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PREFACE 


The standard econometric modelling practice for quite a long time 
was founded on strong assumptions concerning the underlying data 
generating process. Based on these assumptions, estimation and 
hypothesis testing techniques were derived with known desirable, 
and in many cases optimal, properties. Frequently, these as- 
sumptions were highly unrealistic and unlikely to be true. These 
shortcomings were attributed to the simplification involved in any 
modelling process and therefore inevitable and acceptable. The 
crisis of econometric modelling in the seventies led to many well 
known new, sometimes revolutionary, developments in the way 
econometrics was undertaken. Unrealistically strong assumptions 
were no longer acceptable. Techniques and procedures able to deal 
with data and models within a more realistic framework were badly 
required. Just at the right time, i.e., the early eighties when all 
this became obvious, Lars Peter Hansen’s seminal paper on the 
asymtotic properties of the generalized method of moments (GMM) 
estimator was published in Econometrica. Although the basic idea 
of the GMM can be traced back to the work of Denis Sargan in the 
late fifties, Hansen’s paper provided a ready to use, very flexible 
tool applicable to a large number of models, which relied on mild 
and plausible assumptions. The die was cast. Applications of the 
GMM approach have mushroomed since in the literature, which has 
been, as so many things, further boosted recently by the increased 
availability of computing power. 


Nowadays there are so many different theoretical and practical 
applications of the GMM principle that it is almost impossible to 
keep track of them. What started as a simple estimation method 
has grown to be a complete methodology for estimation and hypoth- 
esis testing. As most of the best known “traditional” estimation 
methods can be regarded as special cases of the GMM estimator, it 
can also serve as a nice unified framework for teaching estimation 
theory in econometrics. 


The main objective of this volume is to provide a complete and 
up-to-date presentation of the theory of GMM estimation, as well 
as insights into the use of these methods in empirical studies. 


2 Preface 


The editor has tried to standardize the notation, language, 
exposition, and depth of the chapters in order to present a coherent 
book. We hope, however, that each chapter is able to stand on its 
own as a reference work. 
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de Paris XII in France and the Hungarian Research Fund (OTKA) 
in Hungary for providing financial support; and the editors’ col- 
leagues and students for their valuable comments and feedback. 


The camera-ready copy of this volume was prepared by the 
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Chapter 1 


INTRODUCTION TO THE 
GENERALIZED METHOD OF 


MOMENTS ESTIMATION 


David Harris and László Matyas 


One of the most important tasks in econometrics and statistics is 
to find techniques enabling us to estimate, for a given data set, 
the unknown parameters of a specific model. Estimation proce- 
dures based on the minimization (or maximization) of some kind of 
criterion function (M-—estimators) have successfully been used for 
many different types of models. The main difference between these 
estimators lies in what must be specified of the model. The most 
widely applied such estimation technique, the maximum likelihood, 
requires the complete specification of the model and its probability 
distribution. The Generalized Method of Moments (GMM). does 
not require this sort of full knowledge. It only demands the spec- 
ification of a set of moment conditions which the model should 
satisfy. 


In this chapter, first, we introduce the Method of Moments 
(MM) estimation, then generalize it and derive the GMM estima- 
tion procedure, and finally, analyze its properties. 


4 Introduction to the Generalized Method of Moments Estimation 


1.1 The Method of Moments 


The Method of Moments is an estimation technique which suggests 
that the unknown parameters should be estimated by matching 
population (or theoretical) moments (which are functions of the 
unknown parameters) with the appropriate sample moments. The 
first step is to define properly the moment conditions. 


1.1.1 Moment Conditions 


Derinition 1.1 Moment Conditions 


Suppose that we have an observed sample {z; : t = 
1,...,T} from which we want to estimate an unknown px 1 
parameter vector 6 with true value ĝo. Let f(2:,9) be a 
continuous q x 1 vector function of 0, and let E(f (x+, @)) 
exist and be finite for all t and 0. Then the moment 
conditions are that E(f (z£, 00)) = 0. 


Before discussing the estimation of 0), we give some examples 
of moment conditions. 


EXAMPLE 1.1 


Consider a sample {x; : t = 1,...,T} from a gamma 
y(p*, q*) distribution with true values p* = pj and q* = qj. 
The relationships between the first two moments of this 
distribution and its parameters are: 


Po 2 
E(x) = = E(x, — E(a,)) = +5. 
t x? ( t t ) q? 
In the notation of Definition 1.1, we have 0 = (p*, qg*)’ and 
a vector of q* = 2 functions: 


* 


a a a) 


Then the moment conditions are E(f (x+, 90)) = 0. 


The Method of Moments 
EXAMPLE 1.2 


Another simple example can be given for a sample {z; : 
t = 1,...,T} from a beta (p*, q*) distribution, again 
with true values p* = pj and q* = qj. The relationships 
between the first two moments and the parameters of this 
distribution are: 


_ PB 
Ee) = Fe 

pops + 1) 
(p + 3)(P5 +g +1)” 
With 0 = (p*, q*)’ we have 


E(2;) = 


* 


p 2 p*(p* +1) i 
,0 = ma ae =") . 
SO ce a oe D) 


EXAMPLE 1.3 


Consider the linear regression model 


Yt = Libo + Ut, 


where 2x; is a p x 1 vector of stochastic regressors, Bo is the 
true value of a p x 1 vector of unknown parameters (3, and 
uz is an error term. In the presence of stochastic regressors, 
we often specify 

E(u,|24) = 0, 


so that 
E(y|xe) = 2,80. 
Using the Law of Iterated Expectations we find 
E(xyur) = E(E(zrulz:)) 
= E(a,E (uel) - 
=0 


The equations 


E(x) = E(2e(ye = wo) ) = 0, 
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are moment conditions for this model. That is, in terms of 
Definition 1.1, 0 = 6 and f((az, y+), 9) = zily — £48). 


Notice that in this example E (x,u) = 0 consists of p equations since 
x, is a p x 1 vector. Since ( is a p x 1 parameter, these moment 
conditions exactly identify 6. If we had fewer than p moment 
conditions, then we could not identify 8, and if we had more than 
p moment conditions, then @ would be over-—identified. Estimation 
can proceed if the parameter vector is exactly or over—identified. 
Notice that, compared to the maximum likelihood approach (ML), 
we have specified relatively little information about u,;. Using ML, 
we would be required to give the distribution of u,, as well as pa- 
rameterising any autocorrelation and heteroskedasticity, while this 
information is not required in formulating the moment conditions. 
However, some restrictions on such aspects of the model are still 
required for the derivation of the asymptotic properties of the GMM 
estimator. 


It is common to obtain moment conditions by requiring that 
the error term of the model has zero expectation conditional on cer- 
tain observed variables. Alternatively, we can specify the moment 
conditions directly by requiring the error term to be uncorrelated 
with certain observed instrumental variables. We illustrate with 
another example. 


EXAMPLE 1.4 


As in the previous example, we consider the linear regres- 
sion model 
Ye = Libo + Ue, 

but we do not assume FE(u|x:) = 0. It follows that u, and 
x, may be correlated. Suppose we have a set of instruments 
in the q x 1 vector z+. These may be defined to be valid 
instruments if E(z,u,) = 0. Thus the requirement that 
2, be a set of valid instruments immediately provides the 
appropriate moment conditions 


E(x) = E(ze(ye — 2460)) = 0. 


That is, 
F(t, Ye, zt), 9) = zelye — 2,6). 
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There are q equations here since there are q instruments, 
so we require q > p for identification. 


1.1.2 Method of Moments Estimation 


We now consider how to estimate a parameter vector 9 using mo- 
ment conditions as given in Definition 1.1. Consider first the case 
where q = p, that is, where @ is exactly identified by the moment 
conditions. Then the moment conditions E(f(x:,6)) = 0 represent 
a set of p equations for p unknowns. Solving these equations would 
give the value of 0 which satisfies the moment conditions, and this 
would be the true value 6). However, we cannot observe E(f(.,.)), 
only f(x;,9). The obvious way to proceed is to define the sample 
moments of f (z+, 6) 


fr(9) =T 2 Snb), 


which is the Method of Moments (MM) estimator of E(f (x, 0)). 
If the sample moments provide good estimates of the population 
moments, then we might expect that the estimator 0; that solves 
the sample moment conditions fr(0) = 0 would provide a good 


estimate of the true value o that solves the population moment 
conditions E(f (x+, 0)) = 0. 


In each of Examples 1.1-1.3, the parameters are exactly iden- 
tified by the moment conditions. We consider each in turn. 


EXAMPLE 1.5 


nn 


For the gamma distribution in Example 1.1, fr(0) = 0 
implies 


and 
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Defining the sample mean Zr = T~! ÐZ; z; and sample 


2 
. as T = 
variance s} = T-1S7>)_, (z — Er) , we find 


ae o TT 
q =a 
ST 
and 
z2 
~k T 
Pp = ae 
ST 


EXAMPLE 1.6 


n 


For the beta distribution in Example 1.2, fr(0) = 0 implies 


T P 
T! N` t,- —— = 0, 
2 t PAF 
and m 
Aak f oe 
1 
Dt oe ale) 
rer (P + E + 7 +1) 


With the sample mean and variance defined as in Example 
1.5, we solve these equations to obtain 


7- -an(1- Er(1 ; *| 


sf 


and 


7 = ar- D(1- 220), 


2 
ST 


EXAMPLE 1.7 


For the linear regression model, the sample moment con- 
ditions are 


T T 
si 5 t,t, = T EAC — xbr) = 0, 
t=1 t=1 


and solving for Br gives 


T -1 T 
Br = > zel) DEZ = (X'X) X'y. 
t=1 


t=1 
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That is, OLS is an MM estimator. 


EXAMPLE 1.8 


The linear regression with q = p instrumental variables is 
also exactly identified. The sample moment conditions are 


T T 
p= > Zut = Po 5 zily = x Br) = 0, 
t=1 t=1 


and solving for Br gives 


T “RT 
br = (>: xt) 2 an= (Z'X) T Z'y, 
t=1 


which is the standard IV estimator. 


EXAMPLE 1.9 


The Maximum Likelihood estimator can be given an MM 
interpretation. If the log-likelihood for a single observa- 
tion is denoted 1(6|z,), then the sample log-likelihood is 

-1 57, l(0|x,). The first order conditions for the maxi- 
mization of the a function are then 


GAER) 
m el 


lia = 


These first order conditions can be regarded as a set of 
sample moment conditions. 


1.2 Generalized Method of Moments 
(GMM) Estimation 


The GMM estimator is used when the 0 parameters are over- 
identified by the moment conditions. In this case, the equations 
E(f(x+,90)) = 0 represent q equations for p unknowns which are 
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solved exactly by ĝo. If we now proceed as in the exactly identified 
case to obtain an estimator, we would have 


fr(9r) = 0, 
which is also q equations for p unknowns. Since there are more 
equations than unknowns, we can not find a vector 67 that satisfies 


fr(@) = 0. Instead, we will find the vector Êr that makes fr(@) as 
close to zero as possible. This can be done by defining 


ôr = argmin,Qr(0) 
where 
Qr(9) = fr(9)'Arfr(9), 
and Ar is a stochastic positive definite O,(1) weighting matrix 
whose role will be discussed later. Note that Qr(0) > 0 and 
Qr(6@) = 0 only if fr(@) = 0. Thus, Qr(0) can be made exactly 


zero in the just identified case, but is strictly positive in the over— 
identified case. We summarize this in the following definition. 


Derinition 1.2 GMM Estimator 


Suppose we have an observed sample {z, : t = 1,...,T} 
from which we want to estimate an unknown p x 1 para- 
meter vector 6 with true value ĝo. Let E(f (£: 90)) = 0 be 
a set of q moment conditions, and fr(0) the corresponding 
sample moments. Define the criterion function 


Qr(9) = fr(9)' Arfr(9), 


where Ay is a stochastic O,(1) positive definite matrix. 
Then the GMM estimator of 0 is 


Oy = argmin,Qr(6). 


EXAMPLE 1.10 


For the linear regression model with q > p valid instru- 
ments, the moment conditions are 


E(zt%) = E (z(y — wo) =0, 
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and the sample moments are 


T 
fr(b) =T DS) aly — 248) = T7 (Z'y — ZX). 
t=1 


Suppose we choose 


T -1 
Ar = (7 > azi) =T(Z'Z)*, 


t=1 
and assume we have a weak law of large numbers for zz; so 
that T-!Z'Z converges in probability to a constant matrix 
A. Then the criterion function is 
Qr(B) = T (Z'y — Z'XB)'(2'Z)* (Z'y — Z' XB). 
Differentiating with respect to @ gives the first order con- 
ditions 
0Qr (8 ) | us 
OB B=Br 


= T712X'Z(Z'Z)""(Z'y — Z'X Br) = 0, 


and solving for Gr gives 


al 
Br = (X'Z(Z'Z)*Z'X) X'Z(Z'Z)1Z'y. 


This is the standard IV estimator for the case where there 
are more instruments than regressors. 


1.3 Asymptotic Properties of the 
GMM Estimator 


We present here some sufficient conditions for the consistency and 
asymptotic normality of the GMM estimator, along with outlines of 
the proofs. We also consider the asymptotic efficiency of the GMM 
estimator for a given set of moment conditions. 
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1.3.1 Consistency 


For consistency, we require some assumptions about the structure of 
f (x1, 6), x, and the parameter space. First, we need an assumption 
to ensure that the true value of 0 is identified. 


AssumpTION 1.1 


(i) E (f (a, )) exists and is finite for all 6 € © and for all 
t 


(ii) If g: (0) = E (f (2, @)) , then there exists a 0) € © such 
that g: (0) = 0 for all t if and only if 6 = 4. 


Part (i) of Assumption 1.1 states that the population moments 
exist, which implies some more primitive assumptions about 2; and 
f(.,.). Part (ii) assumes that ĝo can be identified by the population 
moments g: (0). That is, the population moments take the value 
0 at ĝo and at no other value of @. If there were more than one 
value of 0 such that g,(@) = 0, then even full knowledge of the 
population moment conditions would not allow a true value of 6 to 
be identified. 


We consider the property of weak consistency of br, that is 
Or -> o. Consistency is derived from a consideration of the 
asymptotic behavior of Qr (0). In particular, we require an as- 
sumption about the convergence of the sample moments fr (0) to 
the population moments gr (0) = T~! 5}; g: (0). Note that fr (0) 
and gr (0) are q—vectors, and we refer to their individual elements 
as fr; (6) and gr; (0) respectively, for j =1,...,¢. 


ASsuMPTION 1.2 


(fr; (9) — gr; (@)) — 0 uniformly for all 6 € © and for 
all j =1,...,q. 


This assumption requires that f (x+, 0) satisfies a uniform weak 
law of large numbers, so that the difference between the average 
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sample and population moments converges in probability to zero. 
The uniformity of the convergence means that 


sup | fr; (0) — gr; (0)| + 0, for j =1,...,¢. 
MS] 


Note that this is a stronger requirement than pointwise conver- 
gence in probability on ©, which simply requires that (fr; (0) — 
gr; (0)) + 0 for all 6 € ©. The importance of the uniformity of 
the convergence for the proof of consistency is that it implies that 
(fr; (0%) — gr; (04)) + 0, where 0% is some sequence. If we only 
have pointwise convergence in probability then (fr; (04) — gr; (04)) 
may not converge to zero. Amemiya [1985, p. 109] provides two 
such examples. 


We also make an assumption about the convergence of the 
weighting matrix Ar. 


ASSUMPTION 1.3 


There exists a non-random sequence of positive definite 
matrices Ar such that 


Ap = Ap = 0. 
a 


With these assumptions, the weak consistency of Or can be 
shown. 


THEOREM 1.1 Weak Consistency 


Under Assumptions 1.1 to 1.3, the GMM estimator Op 
defined in 1.2 is weakly consistent. 


PROOF OF THE THEOREM 


The basic steps to prove the consistency of Or are as 
follows. 
(i) Deduce from Assumptions 1.2 and 1.3 that there exists 
a non-random sequence 


Qr (6) = gr (0) Argr (8) 
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such that 
Qr (6) — Qr (6) + 0 uniformly for 0 € ©. (1-1) 


(ii) Deduce from Assumption 1.1 that Qr (6) = 0 if and 
only if 0 = 0, Qr (0) > 0 otherwise, and hence 


Oo = argmingceQr (8). (1-2) 


(iii) Show that (1-2) and (1-1) imply that ôr => bo. That 
is, Ôr maiinizes Qr (9), 0o: ize Qr (6) and (Qr (0) — 
Qr (0)) + 0, and hence 67 + 6o. 


Given our assumptions, the intuition of the proof is clear. A 
textbook presentation of the formal details of the proof can be 
found in Amemiya [1985, p. 107]. 


Assumptions 1.1 to 1.3 provide sufficient conditions for the 
weak consistency of 07. However, Assumptions 1.2 and 1.3 are 
quite high level and not easily related to a given sample 2; and 
function f (.,.). We consider more primitive assumptions that are 
sufficient for Assumptions 1.2 and 1.3 to hold. 


There is a considerable literature devoted to the question of 
sufficient primitive conditions for Assumption 1.2. A thorough 
discussion of the technicalities of proving uniform laws of large 
numbers is beyond our scope, so we just give an outline of some 
of the important issues. One approach is to make use of the fact 
that the uniform convergence in probability in Assumption 1.2 can 
be shown to be a consequence of the following three more basic 
assumptions. 


AssuMPTION 1.4 


© is compact. 


Assumption 1.5 


(fr; (9) — gr; (0)) + 0 pointwise on © for j = 1,...,p. 
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ASSUMPTION 1.6 


fr; (0) is stochastically equicontinuous and gr; (6) is 
equicontinuous for j = 1,...,p. 
a 


Assumption 1.4 requires that the parameter space be closed and 
bounded. From a practical point of view, this assumption is not 
always desirable. For example, the parameter space for a variance 
parameter (say o?) may consist of the positive real line R+. That 
is, {o? € R: ø? > 0}, which is neither closed nor bounded. Even 
with some upper bound B for o?, the set {o? € R : 0 < o? < B} 
is only semi-closed. Alternative assumptions and situations in 
which Assumption 1.1 can be relaxed are considered in Hansen 
[1982], Andrews [1992] and Potscher and Prucha [1994]. Hansen 
[1982] discusses a particular structure for the moment conditions for 
which compactness is not required. This structure includes linear 
models such as in Examples 1.3 and 1.4. For our general results, we 
maintain Assumption 1.1 for simplicity and consistency with much 
of the existing literature. 


Assumption 1.5 requires that fr; (@) converges in probability 
to gr; (0) for all 6 € ©. That is, a weak law of large numbers must 
hold for f; (x+,0) for each 6, which requires further assumptions on 
x, and f(.,.). For example, if z, is iid. and E |f; (z:,6)| < co 
for all 6 and j, then gr; (6) = Ef; (21,0) is not a function of T 
and Assumption 1.5 is the result of a standard i.i.d. weak law of 
large numbers for each 0 € ©. If the assumption that 2; is i.i.d. is 
relaxed, then other weak laws can be applied. If z, is allowed to 
be heterogeneously distributed then additional moment conditions 
must be imposed. As a simple example, if 2; is independently 


and heterogeneously distributed such that sup, E ( fi (2i 0)’) < œ 


for all j and 0 (i.e., f (2,0) has bounded variance for all t) then 
Assumption 1.5 follows. Note that this is not the weakest assump- 
tion available, but is easy to interpret. If x, is also be allowed to 
be dependent, then there are various sets of conditions that are 
sufficient for Assumption 1.5. In general, it is necessary to control 
the degree of dependence in x;, and hence f; (7,4), so that as two 
elements x, and 244, become further apart (i.e., as k increases) 
their dependence decreases towards zero. For example, we may 
assume that x; is stationary and ergodic, or we may assume that 
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z is mixing. Both ergodicity and mixing are concepts of asymp- 
totic independence. The use of a mixing assumption is common 
because it allows x, to be both dependent and heterogeneous, and 
also makes explicit the trade-off in the amount of dependence and 
heterogeneity allowed for a weak law of large numbers to hold. If x; 
is mixing and f; (x,6) is a function that contains at most a finite 
number of lags of x+, then f; (x0) is also mixing and satisfies a 
weak law of large numbers for mixing processes. If f; (x+, 0) contains 
an infinite number of lags of x, (e.g., an autoregression), then it 
need not be mixing. In this case, we can impose the additional 
assumption that f; (2,0) is near-epoch dependent on z;. This is 
an assumption which implies that f; (z:,@) is almost completely 
dependent on the most recent lags of z; (or the near-epoch of 
x). For the example of an autoregression, near-epoch dependence 
requires that the zeros of the autoregressive polynomial lie outside 
the unit circle. If fj (x+,0) is near-epoch dependent on z+, then it 
can be shown that f;(2:,0) is a mixingale for which weak laws 
of large numbers can be derived. In all cases, the emphasis is 
on controlling the degree of dependence of f; (x+,0), so that it is 
only weakly dependent on its distant past. Processes with strong 
dependence such as unit root and many long memory processes are 
excluded by such assumptions. Processes with deterministic trends 
are also generally excluded, although Andrews and McDermott 
[1995] provide asymptotic theory for this case. 


The many technical details underlying these concepts are be- 
yond the scope of this introductory chapter. White [1984] provides 
a textbook treatment of limit theorems under various regularity 
conditions, as well as specific concepts such as stationarity (p. 40), 
ergodicity (p. 41) and mixing (p. 45). Davidson [1994] also discusses 
these concepts, along with mixingales (Chapter 16) and near-epoch 
dependence (Chapter 17). Gallant and White [1988] and Potscher 
and Prucha [1991a, b] both provide unified and rigorous expositions 
of the use of these ideas in the derivation of properties of estima- 
tors computed from the optimisation of some criterion function (of 
which GMM is an example). 


Assumption 1.6 (along with 1.4) provides sufficient regularity 
to convert the pointwise convergence of Assumption 1.5 into the 
uniform convergence of Assumption 1.2. The idea is that some 
form of smoothness condition on fr; (@) is required for uniform 
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convergence. Stochastic equicontinuity can be obtained by mak- 
ing a stochastic Lipschitztype assumption on f;(z,,0). That 
is, if there exists a uniformly bounded random sequence b; (i.e., 
sup, E (b:) < oo) and a non-negative deterministic function A (.) 
such that h (x) — 0 as x — 0 such that 


[fy (we, 1) — fi (@e, 2)| < beh (|, — O2||) for all 41,62 € ©, (1-3) 


then fr; (0) can be shown to be stochastically equicontinuous and 
gr; (0) can be shown to be equicontinuous (see Corollary 3.1 of 
Newey [{1991]). In Equation (1-3), ||z|| denotes the vector norm 


( jel 2?) a Intuitively, this condition requires that if 6; and 02 
are arbitrarily close together, the difference between f; (x,,0,) and 
fi (Zt, 92) should also be arbitrarily small. In this sense, it is a 
smoothness condition related to, but generally stronger than, the 
continuity of f; (.,.) with respect to 6. Stochastic equicontinuity is 
a property of both f (.,.) and z. There is a considerable literature 
devoted to deriving sufficient conditions for stochastic equicontinu- 
ity, including Andrews [1987, 1992], Potscher and Prucha [1989a, 
1994], Newey [1991] and Hansen [1996]. Davidson [1994] (Chapter 
21) provides a textbook treatment of uniform convergence. The 
Lipschitz condition given here is only one example of a sufficient 
condition. 


Note that in order to show strong consistency of the GMM 
estimator (as does Hansen [1982], for example), it is necessary to 
strengthen Assumptions 1.3 and 1.4 by replacing convergence in 
probability with convergence almost surely. This in turn requires 
stronger primitive assumptions about x; and f (.,.) than those 
required for convergence in probability. For example, Davidson 
[1994, p. 339] points out that the extra regularity required to obtain 
strong stochastic equicontinuity can be considerable. 


1.3.2 Asymptotic Normality 


Assumptions in addition to those required for consistency are 
needed to derive an asymptotic normality result for the GMM 
estimator. 
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ASssuMPTION 1.7 


f (x2, 6) is continuously differentiable with respect to 9 on 
O. 


That f (x+, 0) is differentiable is required for the maximization 
of Qr (0) to proceed via the solution of the first order conditions 
OQr (êr) /00 = 0. It follows from this assumption that fr (0) is 
continuously differentiable, and we denote its first derivative as 

Ofr (9) 
Fr (6) = ———. 
r (0) = ~ag 


Note that 


Ofr (0) -1 ^ OF (21,9 
86’ K y Heee) 86! 2 


so the following assumption requires that a weak law of large num- 
bers applies to the first derivative of f (x;,6) in a neighbourhood 
of ĝo. 


AssuMPTION 1.8 


For any sequence 6% such that 6%, => 8o, 
Fr (65) — Fr = 0, 


where Fr is a sequence of q x p matrices that do not depend 
on 6. 


ASSUMPTION 1.9 


f (1,90) satisfies a central limit theorem, so that 
Vp PO VT fr (00) > N (0, h), 


where Vy is a sequence of q xq non-random positive definite 
matrices defined as 


Vr = Tvar fr (8o) . 


Sufficient primitive conditions on f (.,.) and 2, can be found for 
Assumptions 1.8 and 1.9. For example, Assumption 1.9 may follow 
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from a central limit theorem for independent and heterogeneously 
distributed processes (e.g., White [1984, p. 112]), stationary and 
ergodic processes (e.g., White [1984, p. 118]), mixing processes 
(e.g., White [1984, p. 124]) or near-epoch dependent functions 
of mixing processes (e.g., Davidson [1994, p. 386]). In general, 
the weakest available conditions for such central limit theorems 
are stronger than those for the corresponding weak laws of large 
numbers referred to in relation to Assumption 1.5. Thus, given the 
way in which such assumptions are presented, we make relatively 
weak assumptions about f (x:,0) for all 6 € © to obtain uniform 
convergence in probability, and relatively strong assumptions about 
f (x1, 9) to obtain the central limit theorem. It is not necessarily 
the case that a central limit theorem need apply to f («:,0) for 
0 # bo. 


We now give an asymptotic normality result for the GMM esti- 
mator. The method of proof involves an expansion of the moment 
conditions about the true value of 9), which is a standard technique 


in showing asymptotic normality. Our steps follow those of Hansen 
[1982]. 


TuHeoreM 1.2 Asymptotic Normality 


Under Assumptions 1.1 to 1.3, and 1.7 to 1.9 the GMM 
estimator 0; has the asymptotic distribution 


(Fr (E1) ArVrArFr (Gr) (Pr (Br)' ArFr (r)) * 


x VT (êr — bo) > N (0, Ip). 


PROOF OF THE THEOREM 


Consider a mean—value expansion of the vector of moment 
conditions around 6o : 


fr (Or) = fr (00) + Fr (04) (êr — 4) , 
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where 9%. lies between Or and fo. Note that 6} Eia 


because ĝr is consistent. If we pre-multiply this equation 
by 


Fr (êr) Ar 
we get 
Fr (6r) Arfr (êr) = 
Fr (êr) Arfr (90) + Fr (êr) ArFr (67) (Gr — 0) = 0, 


since the left hand side represents the first order conditions 
for the minimization of Qr (6). Re-arranging and multi- 
plying by VT gives 


(¥:(@r)' ArVrArFr (Br) (Fr (Er) ArFr (63)) 
x VT (êr — 8o) 
E (Fr (êr) ArVrArFr (*)) Fr (Gr)' Ap Ve!?0 577 


x VT fr (o). 


In view of Assumption 1.9, the right hand side of this 
expression is asymptotically distributed as N (0,J,). On 
the left hand side, we can use Assumption 1.8 to replace 
Fr (07) with Fr (êr) with no change to the asymptotics. 

a 


The asymptotic distribution result in Theorem 1.2 is not oper- 


ational for the derivation of test statistics because Vr is unknown 
and must be estimated. There is a considerable literature on the 
consistent estimation of Vr, see Andrews [1991], Newey and West 
[1994] and den Haan and Levin [1996] for a survey. See also Chap- 
ter 3, or further reference, we simply assume that such a consistent 
estimator can be obtained. 
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Assumption 1.10 


We can calculate Vr such that Vr = Vases 0. 
a 


Since Vr = Tvarfr (0) and ĝo is unknown, the consistent 
estimator Vr i is calculated using fr (Gr). 


1.3.3 Asymptotic Efficiency 


We consider the asymptotic efficiency of the GMM estimator the 
perspective of choosing the appropriate weighting matrix Ar. Fol- 
lowing Definition 4.20 of White [1984], Theorem 1.2 implies that 
the asymptotic covariance matrix of the GMM estimator for a given 
set of moment conditions is 


(ĒLArĒr) ` FL ApVpAr Fy (F.Ar Fr) 
The weighting matrix Ar can be seen to have an effect only on 
the variance of the asymptotic distribution of the GMM estimator, 
and not on the consistency. We therefore choose Ar to minimize 


the asymptotic covariance matrix. The appropriate choice of Ar is 
suggested by the following lemma. 


—i 


Lemma 1.1 


=1 -1 


(FL Ar Fr) FL ApVp Ap Fr (Fi ArFr) 
is positive semi-definite for all Ar. 


— (FLV; Fr) 
a 


This suggests that we should choose Ar = Vz +. Then we can 
deduce from Theorem 1.2 that the asymptotic covariance matrix of 
this GMM estimator is (FV; Fr)", which is the smallest possible 
for the given moment conditions. This choice of Ar requires a 
two step estimation procedure because consistent estimation of Vr 
requires a consistent estimator of 6). An initial consistent estimator 
of 0) can be calculated for some arbitrary choice of Ar (say Ar = 
I), which can then be used to calculate Vr and the asymptotically 
efficient GMM estimator. 


A disadvantage of this two step approach is that the resulting 
asymptotically efficient GMM estimator is not unique depending on 


22 Introduction to the Generalized Method of Moments Estimation 


which arbitrary weighting matrix was used to calculate the initial 
estimator. That is, different estimators are obtained depending on 
the initial choice of Ar. Although there is no difference asymptoti- 
cally, this is still an undesirable feature. There are some alternative 
estimators that have been proposed to avoid this problem. Hansen, 
Heaton and Yaron [1996] suggested two approaches. The first is 
to iterate the estimation procedure until the parameter estimates 
and the weighting matrix converge. The second is to acknowledge 
the dependence of the optimal weighting matrix on the parameter 
vector and minimize 


Qr(0) = fr(9)'Ar(®) fr(@) 


with respect to 0. For the i.i.d. case 


Ar(9) = T D f(t 0) f (x2, 9)’, 


while if dependence and heterogeneity is allowed in x, then more 
complicated forms of Ar(0) are necessary, see Chapter 3. Also, 
Imbens [1997] and Smith [1997] provide developments of GMM es- 
timation based on non-parametric approximations to the likelihood 
function. Each provides asymptotically efficient estimation without 
being dependent on an initial choice of weighting matrix. 


Note, that this discussion applies only to the over—identified 
case where there are more moment conditions than parameters (q > 
p). If p = q (i.e., @ is exactly identified) then no choice of Ar is 
required since Or can be calculated by solving the sample moment 
conditions fr (êr) = 0 exactly. If estimation were to proceed by 
the minimization of Qr (0), the above expression for the asymptotic 
covariance matrix reduces to Fp 'VrEr" for any choice of Ar. 


1.3.4 Illustrations 


In order to illustrate the concepts of the previous sections, we 
consider two examples. The first shows how things simplify when 
i.i.d. observations are assumed, and the second considers again the 
instrumental variables estimation of a linear regression. 
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1.3.4.1 GMM With i.i.d. Observations 


If z; is assumed to be i.i.d., f (2,9) is also iid. and E (f (a,6)) = 
g: (9) = g (0) for all t. In this case, Assumption 1.1 requires that 
the expectation exists and g (8o) = 0 if and only if 0 = 0). Then 
gr (0) also simplifies to g(@), so that Assumption 1.2 requires 
that fr; (0) — g;(@) uniformly on © for all j. We can also 
replace Assumption 1.3 with the simplified statement Ar —> A, 
where A is a non-stochastic positive definite matrix. Then the 
proof of consistency given in Theorem 1.1 involves us showing that 
Qr (0) + Q (0) uniformly on ©, where Q (0) = g (0) Ag (0). It 
follows that 0o = argmingcoQ (0), from which consistency can be 
shown. 

With respect to asymptotic normality, Assumption 1.8 requires 
that the first derivative of the moments satisfies 


Ofr (0) _ * p 
30" 0=6%, =e) < 


for any sequence 6% such that 04 -> 6, where F is a non- 
stochastic q x p matrix of full column rank. Since f (x+, 0) is i.i.d., 
the central limit theorem of Assumption 1.9 can be expressed as 


VT fr (80) + N (0, V), 


where V = var (f (z1,90)). If, as in the proof of Theorem 1.2, we 
perform a mean—value expansion of fr (êr) about 0o, pre-multiply 


by Fr (êr) Ar and re-arrange, we obtain 
VT (ôr ~ 9) = 
PN -1 AONI 
- (Fr (r) ArFr(63)) Pr (2) ArVT fr (6) 
4, N (0,(F'AF)' F'AVAF (F'AF)"). 
The asymptotic covariance matrix of Êr is therefore 


(F'AF)™* F'AVAF (F'AFY*, 


which in view of Lemma 1.1 is minimized if we can choose Ar 
such that A = V-t. Then the asymptotic covariance matrix would 
reduce to (F’V-!F)~*. We therefore require a consistent estimator 
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of V. Since f (2,6) is i.i.d., a consistent estimator, if we knew 6, 
would be 


T 
Vr = ‘ie Se f (£i, 80) f (£i, boy A 


Since ĝo is unknown, we can calculate an initial consistent (but 
asymptotically inefficient) estimator ns ) using Ar = I, (or some 
other known matrix), and calculate the estimator Êr using Ap = 
Vz 1 where 


Vr = TSF (2,4) f (2,0). 
t=1 


For a given set of moment conditions, Oy is asymptotically the most 
efficient GMM estimator. 


1.3.4.2 Regression With Instrumental Variables 


We can use the example of a regression estimated using instrumen- 
tal variables to illustrate the preceding ideas. In particular, we can 
show how the properties of the estimator can be derived working 
only with the moment conditions and criterion function given in 
Example 1.10, without explicitly writing down the estimator. For 
the linear regression model with q > p instruments considered in 
Example 1.10, we have the sample moments 


Ff (es Yes 2) , B) = % (Ye — 248) 
= Zt (uz = Ly (8 = Bo)) . 


Assuming that (£1, 2, uz) is i.i.d., we have 


T T 
fr (8) = ES > zu, =T 5 22; (8 — Bo) 
t=1 t=1 


B Dass (B- Bo) 
=9(8), 


where the admissability of the instruments guarantees that 


T 
T7? 5 ZtUt 2 E (zeus) = 0, 


t=1 
and 
T 
—1 1 P IN 
T xs zz, — E (axi) = Les, 


t=1 
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where &,,, has full column rank. Note, that the admissability of the 
instruments also guarantees that g (8) = 0 if and only if 8 = bo, 
thus identifying o. Furthermore, 


Qr (8) = fr (BY Arfr (B) 
— 9 (B)' Ag (8) 
= Q() 


for a suitable choice of Ar such that Ar —> A, and hence bọ = 
argming@ (8). Therefore the consistency of the IV estimator can 
be shown without writing down the closed form of the estimator. 


With respect to asymptotic normality, we have 


T 
Ofr (P) =ef X zs, ++ -En = F, 
Op’ t=1 


and 
varf (ae Yt, ze) , Bo) = var (zru) = o5 = V, 


where o? = E (u?| z) and £, = E (22). Thus for any Ar we have 
VT (Br - fo) + 
N (0,02 (BezA¥ex)* Da2 ADzzA¥ zz (UezA¥z2) `") 
If, as suggested in Example 1.10, we choose 
> te 
Ar = i 5 z) 
t=1 


so that Ar — A = X72, the asymptotic distribution of Br becomes 


VT (Br — Bo) = N (0, (Z22Ez7Be2)') 


In view of Lemma 1.1, this is the asymptotically most efficient 
instrumental variables estimator using 2. 


1.3.5 Choice of Moment Conditions 


It is also important to consider unconditional asymptotic efficiency, 
that is, which moment conditions provide the most asymptotically 
efficient GMM estimator. One can argue that the more orthogo- 
nality conditions are used (the larger the information set) the more 
asymptotically efficient the GMM estimator is. It is true in general 
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that including extra moment conditions will not reduce asymptotic 
efficiency. However, we now give two examples that caution against 
the indiscriminate use of large numbers of moment conditions. 


EXAMPLE 1.11 


Consider again the example of regression with instrumental 
variables given in Section 1.3.4.2. Suppose for simplicity 
that xz, is a single stochastic regressor and z; is a single 
instrumental variables such that (x+, Z+, u+) is zero mean 
and iid. and E(u,|z,) = 0. From this conditional ex- 
pectation we can derive an infinite number of moment 
conditions E(u,h(z,)) = 0 where h(.) is any function of z+. 
For example, E(uz2;) = 0 and E(u,2z?) = 0 may be admiss- 
able instruments for z;. Following the notation of 1.3.4.2, 
the IV estimator using these instruments and the efficient 
weighting matrix has asymptotic variance (U.,U7D22) 7, 


where f 
g = E (L224) 
a Elz)’ 
and : 2 
noe E(z;) Elz) 
VE) E))° 
However, it is important that in addition to E(u,z?7) = 0 
we know that E(2,2?) Æ 0. That x, and z; are highly 
correlated does not imply that x, and z? are correlated. If 
E(x,2?) = 0 then 
E(z}) — E(2?)?/E (zp) 
E (zi)? 
Thus, if E(x:z?) = 0 and z% is symmetrically distributed 


then there is no improvement in asymptotic efficiency by 
including z? as an additional instrument. 


(ee ee) z 


The previous example illustrates that additional moment conditions 
will only improve asymptotic efficiency if they contribute extra 
information to that contained in the existing moment conditions. 
The following example further illustrates this point using a panel 
data model. 
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EXAMPLE 1.12 


For the dynamic error components panel data model (Se- 
vestre and Trognon [1995}) 
Yit = OYie1 H Lb + Uit Uit = Hi + Vit 
t= bhen N tslat 
E(m) = E(va)=0, E(u) = T, 
E(vi,) = oF» E(uvit) = 0 
the “usual” OLS, (F)GLS, Within, Between, etc. estima- 
tors are inconsistent when N — oo (but T fixed). The 
GMM technique now consists of estimating this model by 
using the 
E(yio — Zio) =0 
El(yio — ig)? = oF, 
El(yio — Lob) (Yi — SY- — rp) =O Y t=1,...,T 
Ey = OYit—1 aad zab) =0 V t= Lins . pe 
El (yi — Syite-1 — £48) (Yis — SYis-1 — 2,8)] = 0 
t,s=1,...,T t#s 
El(yat — Syit—1 — 248) (Yee — Syie-1 — Lab) = o, +o; 
VEST cag 
Elyo(yie — 6Yu-1 ~ rap) =e V t=1,...,T 
El(yo — 2,6)2=]=0 Y t=1,...,T k=1,...,r 
Eliya —2,0)2%|)=0 Y t=1,...,T k=1,...,7 
El(yie — Yit- — a',8) 2%] =0 
Vtjs=1,...,.T s¢t k=1,...,r 
El(vie — Sya- — 2,0)e%]=0 Y t=1,...,T 
k= 2.5357 
orthogonality conditions. In this case the GMM estimator 
has no closed analytical form. Also, the number of mo- 
ment conditions is proportional to the time dimension T. 
As N — oo these moment conditions become perfectly 


collinear so that some are redundant in the sense they 
provide no extra information. 
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EXAMPLE 1.13 


Hansen and Singleton [1982] considered the estimation of 
a consumption based capital asset pricing model. This 
is a non-linear rational expectations model where agents 
maximize 


$E, (P'U (09), 


subject to the budget constraint 
Ci + PQ: < ViQi-i + Wi. 


In this model, C; is real consumption in period t, W, is 
real labor income, 8 is the discount factor and U (C;) 
is a strictly concave utility function. In this simplified 
version of the model, there are bonds of one period ma- 
turity available at price P, and with payoff V,.,. Agents 
hold a quantity Q, of these bonds at each time period. 
If U(C,) = (CF — 1) /a then the first order condition for 
utility maximization is 


C &—l 
E, (s ( st) ) 1) =0, 
t 


where Riii = Viı/P, is the return on a bond held from 
time t tot+1. This first order condition then gives moment 
conditions of the form 


a-1 
E (» ($) Resi — 1) z =0, 


where z; is any vector of variables drawn from the infor- 
mation set available at time t. This may include lags of 
Cr4i/C, and Ri41. This type of application was important 
in the development of GMM in econometrics. It allowed 
the estimation of the parameters of economic models that 
do not have explicit solutions to their first order conditions. 
Furthermore, it did so without the imposition of specific 
statistical restrictions such as distributional assumptions 
for the generation of the data. 
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1.4 Conclusion 


In this chapter we introduced the basic ideas behind the Gener- 
alized Method of Moments estimator and, also, derived its main 
properties. With the help of several examples, we showed how this 
approach relates to other estimation techniques used in economet- 
rics and statistics. 


Most of the methods and procedures presented in Chapter 1 
are studied in more depth in the following chapters of the volume. 
The next chapter analyzes in detail the GMM estimation of various 
econometric models using several different assumptions about the 
data generating process. Chapter 3 deals with issues related to 
the estimation of the optimal weighting matrix, while Chapter 4 
shows how hypothesis testing is performed in the GMM framework. 
Chapter 5 collects the main results available in the literature about 
the finite sample behavior of the GMM estimator for the most 
frequently used econometric models. Chapters 6 and 7 show how 
the GMM techniques can be applied to different time series models. 
Chapters 8 and 9, through the examples of panel data models, show 
how to derive different types of GMM estimators and also analyze 
many important practical problems like the choice of moment con- 
ditions (instruments) or the trade off between asymptotic efficiency 
and small sample performance. Chapter 10 takes the reader to the 
frontiers of this methodology, and finally, Chapter 11 illustrates the 
usefulness of the GMM approach through a theoretical application. 
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Chapter 2 


GMM ESTIMATION TECHNIQUES 


Masao Ogaki 


In this chapter, GMM estimation techniques are discussed in the 
context of empirical applications. The first section presents GMM 
applications for stationary variables. The second section discusses 
applications in the presence of nonstationary variables. In the 
third section, general issues for GMM estimation techniques are 
discussed.+ 


2.1 GMM Estimation 


This section defines a GMM estimator in a framework which is ap- 
plicable to the applications covered in this chapter. This framework 
is simpler than the general framework presented in Chapter 1. 


2.1.1 GMM Estimator 


As in Chapter 1, GMM is based on moment conditions, 
E(f(x:,4)) =0, (2-1) 


for a q x 1 vector function f(21,00), where ĝo is a p-dimensional 
vector of the parameters to be estimated. We refer to us = f (£1, 00) 
as the GMM disturbance. In this chapter, it is assumed that {x+ : 


1 See Ogaki [1993a] for an nontechnical introduction to GMM. Ogaki [1993b] 
describes the use of the Hansen/Heaton/Ogaki GAUSS GMM package. 
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t= 1,2...} is astationary and ergodic vector stochastic process un- 
less otherwise noted. Let Ar be a sequence of positive semidefinite 
matrices which converge to a non-stochastic positive definite matrix 
A, fr(as,0) = TEL, f (2,0), and Qr(9) = fr(9)'Arfr(ar, 0). 
The matrices Ar and A are refereed to as distance (or weighting) 
matrices. The basic idea of GMM estimation is to mimic the 
moment conditions (2-1) by minimizing a quadratic form of the 
sample means Qr(0), so that the GMM estimator is 


Op = argmin,Qr(6). 


As shown in Chapter 1, the GMM estimator is weakly consistent, 
and is asymptotically normally distributed. It should be noted 
that strong distributional assumptions such as f(x, 0o) is normally 
distributed are not necessary. 


When the number of the moment conditions (q) is equal to 
the number of parameters to be estimated (p), the system is just 
identified. In the case of a just identified system, the GMM esti- 
mator does not depend on the choice of distance matrix. When 
q > p, there exist overidentifying restrictions and different GMM 
estimators are obtained for different distance matrices. In this 
case, one may choose the distance matrix that results in an as- 
ymptotically efficient GMM estimator. As shown in Chapter 1, an 
asymptotically efficient estimator is obtained by choosing A = V—}, 
where 


j 
V= lim 2 E(uru;;) (2-2) 


is called the long-run covariance matrix of the GMM disturbance 
u. With this choice of the distance matrix, VT (êr — 6o) has an 
asymptotic normal distribution with mean zero and the covariance 
matrix {F’A-!F}"1. 


Let Vr be a consistent estimator of V. Then Ar = Vr’ is 
used to obtain 67. The resulting estimator is called the optimal 
or efficient GMM estimator (see, also, Chapter 3). It should be 
noted, however, that it is optimal given f(x:,0). In the context 
of instrumental variable estimation, this means that instrumental 
variables are given. The selection of instrumental variables will be 
discussed in Section 4. Let Fr be a consistent estimator of F. Then 
the standard errors of the optimal GMM estimator r are calculated 
as square roots of the diagonal elements of T-!{F,V;'Fr}~1. The 
appropriate method for estimating V depends on the model. This 
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problem will be discussed for each application in this chapter. It 
is usually easier to estimate F by Fp = (1/T) ZZ (Of (a1, Or) /00’) 
than to estimate WV. In linear models or in some simple nonlinear 
models, analytical derivatives are readily available. In nonlinear 
models, numerical derivatives are often used. 


Depending on how the distance matrix is estimated, there exist 
alternative GMM estimators. The two stage GMM estimator uses 
an identity matrix as the distance matrix in the first stage, then the 
first stage estimator is used to estimate V, which will be used for 
the second stage GMM estimator. The iterative GMM estimator 
repeats this procedure until 6; converges or until the number of 
iterations attains some large value. In some nonlinear models, 
the iterations may be costly in terms of time. In such cases, it 
is recommended that the third, fourth, or fifth stage GMM to be 
used because the gains from further iterations may be small. Monte 
Carlo experiments of Kocherlakota [1990] suggest that the iterative 
GMM estimator performs better than the two stage estimator in 
small samples. 


For some applications, small sample properties of these conven- 
tional GMM estimators are reasonable as Tauchen [1986] shows. In 
other cases, small sample biases of GMM estimators are large as 
shown by Kocherlakota [1990]. More recently, many authors have 
shown that the conventional GMM estimators have serious small 
sample problems in certain applications (see, e.g., Altonji and Segal 
[1996], Burnside and Eichenbaum [1996], Christiano and den Haan 
[1996], and Ni [1997]). 


Partly in response to these studies, alternatives to the conven- 
tional GMM estimators have been proposed. Hansen, Heaton, and 
Yaron [1996] propose the continuous-updating estimator. The idea 
is to minimize fr(0)’Vr(0)~! fr(xz, 0) by choice of 8 by allowing Vr 
to vary with 0. Their Monte Carlo experiments suggest that the 
continuous-updating estimator typically has less median bias than 
the two stage and iterative estimator, but its distribution has fatter 
tails. Imbens [1997], Kitamura and Stutzer [1997], and Imbens, 
Johnson, and Spady [1998] use information theoretic approaches to 
propose alternatives to the conventional GMM estimators. These 
new estimators seem to perform better than the conventional GMM 
in many cases. As is the case with any new estimators, however, 
much work remains in order to understand exactly when and which 
of these new alternative estimators should be used. 
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2.1.2 Important Assumptions 


This section discusses two assumptions under which large sample 
properties of GMM estimators are derived. These two assumptions 
are important in the sense that applied researchers have encoun- 
tered cases where these assumptions are obviously violated unless 
special care is taken. 


In Hansen [1982], X; is assumed to be (strictly) stationary. A 
time series {X; : 00 < t < oo} is stationary if the joint distribution 
of {X;,..., Xer } is identical to those of Xi45,..., Xt+s47} for any 
T and s. Among other things, this implies that when they exist, 
the unconditional moments E(X,) and E(X;X;,,) cannot depend 
on ¢ for any 7. Thus this assumption rules out deterministic 
trends, autoregressive unit roots, and unconditional heteroskedas- 
ticity. On the other hand, conditional moments E(X- | I) and 
E(Xt4,Xi,,45 | J) can depend on J;. Thus the stationarity as- 
sumption does not rule out the possibility that X, has conditional 
heteroskedasticity. It should be noted that it is not enough for 
Ut = f (Xt, Bo) to be stationary. It is required that X; is stationary, 
so that f (X+, 8) is stationary for all admissible 8, not just for 8 = Bo 
(see Section 8.1.4 for an example in which f(X:, 6o) is stationary 
but f(X¢, 8) is not for other values of 8). 


Gallant [1987] and Gallant and White [1988] show that the 
GMM strict stationarity assumption can be relaxed to allow for 
unconditional heteroskedasticity. This does not mean that X, can 
exhibit nonstationarity by having deterministic trends, autoregres- 
sive unit roots, or an integrated GARCH representation. Some of 
their regularity conditions are violated by these popular forms of 
nonstationarity. Recent papers by Andrews and McDermott [1995] 
and Dwyer [1995] show that the stationarity assumption can be 
further relaxed for some forms of nonstationarity. However, the 
long-run covariance matrix estimation procedure often needs to be 
modified to apply their asymptotic theory. For this reason, the 
strict stationarity assumption is emphasized in the context of time 
series applications rather than the fact that this assumption can be 
relaxed. 


Another important assumption is Assumption 1.1 in Chap- 
ter 1 for identification. This assumption is obviously violated if 
f(x:,9) = 0 for some 0 which did not have any economic meaning. 
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2.2 Nonlinear Instrumental Variable 
Estimation 


GMM is often used in the context of nonlinear instrumental variable 
(NLIV) estimation. This section describes NLIV estimation as a 
special case of GMM estimation, and explains how the long-run 
covariance matrix of the GMM disturbance, V, depends on the 
information structure of the model. 


Let h(y,,9) be a k-dimensional vector of functions and e; = 
h(y:, 99). Suppose that there exist conditional moment restrictions, 
Efe; | I] = 0. Here it is assumed that I, C I+: for any t. Let z; bea 
qX k matrix of random variables that are in the information set I,.? 
Then by the law of iterative expectations, we obtain unconditional 
moment conditions: 


E|zeh(we, 0o)] = 0. 


Thus we let 2, = (w}, zi) and f(a:,0) = zh(w:,0) in this case. 
Hansen [1982] points out that the NLIV estimators discussed by 
Amemiya [1974], Jorgenson and Laffont [1974], and Gallant [1977] 
can be interpreted as optimal GMM estimators when e; is serially 
uncorrelated and conditionally homoskedastic. 


The form of the long-run covariance matrix of the GMM distur- 
bance depends on the information structure of the economic model 
as in the following three cases. An example of each of these cases 
will be given in the next section. 


The simplest case is when e; is in [,,,. Then w is serially uncor- 
related because E(u,u,,;) = E(E(urui,,; | Ly1)) = E(w E(ui,; | 
Ti41)) = 0 for j > 1. Therefore, V = E(u,ui), which can be 
estimated by (1/(T — p)) EZ f (£ Or) f (xe, Or)’: 


The second case is when e, is in the information set [y4541, 
where s > 0. Then E(wuj,,) = E(E(uuj,, | Tts+41)) = 
E(wE(u,, | Is41)) = 0 for T > s+ 1. Thus the order of 


serial correlation of u; is s, and u; is a moving average of order 


s. In this case, V = $j- E(u;u,_;), whose natural estimator 


is a truncated kernel estimator (see, e.g., Ogaki [1993a]). If the 


In some applications, z is a function of 0. This does not cause any problem 
as long as the resulting f(x:,0) can be written as a function of b and a 
stationary random vector x+. 
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truncated kernel estimator results in a non-positive semidefinite 
matrix, another method must be used. 


The third case is when e; is not in the information set [;4,41 for 
any s > 0. Unless some other restrictions are imposed, the serial 
correlation structure of u; is not known. The long-run covariance 
matrix of the GMM disturbance is given by (2-2) in this case, 
and an estimator of V should allow an unknown order of serial 
correlation. 


2.3 GMM Applications with 
Stationary Variables 


This section illustrates GMM estimation techniques for models with 
stationary variables by discussing examples of empirical applica- 
tions. Problems that researchers have encountered in applying 
GMM and procedures they have used to address these problems 
are discussed. 


2.3.1 Euler Equation Approach 


Hansen and Singleton [1982] show how to apply GMM to a Con- 
sumption-Based Capital Asset Pricing Model C-CAPM. Consider 
an economy in which a representative agent maximizes 
dF EU(C) | Io) (2-3) 
t=1 
subject to a budget constraint. Hansen and Singleton [1982] use 
the isoelastic intraperiod utility function 


U(C.) = GF" = 0), (2-4) 


where C, is real consumption at date t and a > 0 is the reciprocal 
of the intertemporal elasticity of substitution (a is also the rela- 
tive risk aversion coefficient for consumption in this model). The 
standard Euler equation for the optimization problem is 


E(BCH Ress |L] _ 


Ore 1, (2-5) 
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where R;+ı is the gross real return of any asset. The observed C; 
they use is obviously nonstationary, although the specific form of 
nonstationarity is not clear (difference stationary or trend station- 
ary, for example). Hansen and Singleton use C;4,/C; in their econo- 
metric formulation, which is assumed to be stationary. Then let 6 = 
(GB, a), We = (Cri /Cr, Regi)’ and h(uz, 0) = B(Cr41/C) Ri — 1 
in the notations for the NLIV model in Section 2.1 above. Any 
stationary variables in [, can be used for instrumental variables z+, 
and Hansen and Singleton [1982] use the lagged values of w. 


In this case, w is in [,4, because Ci,, and Ripi are observed 
in period t+ 1, and hence w is serially uncorrelated. Hansen and 
Singleton [1984] find that the Chi-square test for the overidentifying 
restrictions rejects their model especially when nominal risk free 
bond returns and stock returns are used simultaneously. Their find- 
ing is consistent with Mehra and Prescott’s [1985] equity premium 
puzzle. When the model is rejected, the Chi-square test statistic 
does not provide much guidance as to what causes the rejection. 


2.3.2 Habit Formation 


Many researchers have considered effects of time—nonseparability 
in preferences on asset pricing. Replace (2-4) by 


1 
U(C,, C11, Ci-2, a .) = o — 1), (2-6) 


where S; is service flow from consumption purchases. Purchases of 
consumption and service flows are related by 


Si = AoC: + aiCr_1 + a2C4-2 +.... (2-7) 


Depending on the values of the a a,’s, this model leads to a model 
with habit formation and/or durability. Constantinides [1990] ar- 
gues that habit formation could help solve the equity premium 
puzzle. He shows how the intertemporal elasticity of substitution 
and the relative risk aversion coefficient depend on the a a, and a 
parameters in a habit formation model. ) 


In this section, I will discuss applications by Ferson and Con- 
stantinides [1991], Cooley and Ogaki [1996], and Ogaki and Park 
[1998] to illustrate econometric formulations. In their models, it is 
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assumed that a a, = 0 for T > 2 and aọ is normalized to be one, so 
that 0 = (G,a,a,). The asset pricing equation takes the form 


ECS + BarSeo} Ress | Le) _ 
E[S; * + BaS | Lt] 


Then let e? = (SRA + Bai S;,3) Ris: — (S| + barsani). Though 
Euler equation (2-8) implies that E(e? | I) = 0, this cannot be 
used as the disturbance for GMM because both of the two regular- 
ity assumptions discussed in Section 2.5 of the present paper are 
violated. These violations are caused by the nonstationarity of Ci 
and by the three sets of trivial solutions, a = 0 and 1+ Ga, = 0; 
8 = 0 and a = œ; and 8 = 0 and a; = œ with a positive. Ferson 
and Constantinides [1991] solve both of these problems by defining 
e, = €e? /S7%. Since Sy* is in I,, E(e, | I+) = 0. The disturbance is 
a function of Si4,/5: (T = 1,2) and Ri+ı. When C4i1/C; and R: 
are assumed to be stationary, S,,,/S, and the disturbance can be 
written as a function of stationary variables. 


1. (2-8) 


One problem that researchers have encountered in these appli- 
cations is that C,1; + a,C; may be negative when a is close to 
minus one. In a nonlinear search for br or in calculating numerical 
derivatives, a GMM computer program will stall if it tries a value 
of a, that makes Cy1, + aıC; negative for any t. Atkeson and 
Ogaki [1991] have encountered similar problems in estimating fixed 
subsistence levels from panel data. One way to avoid this problem 
is to program the function f(x,,0), so that the program returns 
very large numbers as the values of f(x:,0), when non-admissible 
parameter values are used. However, it is necessary to ignore these 
large values of f(z:,@), when calculating numerical derivatives. 
This can be done by suitably modifying programs that calculate 
numerical derivatives. 


Let z, be a vector of instrumental variables, which are in H. 
Because e, and the GMM disturbance u; = ze; are in Iia in this 
model, e, has a moving average of order 1 structure. Hence, V = 
D E(u,u,_;) in this case. 


3 Ogaki [1998] explains these modifications for Hansen/Heaton/Ogaki GMM 
package. 


GMM Applications with Stationary Variables 39 
2.3.3 Linear Rational Expectations Models 


Hansen and Sargent [1980] develops a method to apply GMM to 
linear rational expectations models with stationary variables, which 
impose the restrictions implied by linear rational expectations mod- 
els. Section 2.4 discusses how to modify Hansen and Sargent’s 
method in the presence of nonstationary variables. 


Many linear rational expectations models have an implication 
that an economic variable depends on a geometrically declining 
weighted sum of expected future values of another variable 


oO 
Y,=aE(O Xiri | L) + eZ, (2-9) 
i=1 
where a and b are constants, c is a vector of constants, Y, and X: 
are random variables, and Z; is a random vector. This implication 
imposes nonlinear restrictions on the VAR representation of Y;, X+, 
and Z; as shown by Hansen and Sargent [1980]. 


Consider West’s [1987] model as an example of linear rational 
expectations model. Let p, be the real stock price (after the div- 
idend is paid) in period t and d; be the real dividend paid to the 
owner of the stock at the beginning of period t. Then the arbitrage 
condition is 

pi = E[b(peri + di1) | L, (2-10) 


where b is the constant real discount rate, I, is the information 
set available to economic agents in period t. Solving (2-9) forward 
and imposing the no bubble condition, we obtain the present value 
formula: 


Pi = EQ, bdii | Le). (2-11) 
i=1 


We now derive restrictions for p; and d; implied by (2-11). Many 
linear rational expectations models imply that a variable is the 
expectation of a discounted infinite sum conditional on an informa- 
tion set. Hence similar restrictions can be derived for these rational 
expectations models. 


Assume that d; is covariance stationary with mean zero (imag- 
ine that data are demeaned), so that it has a Wold moving average 
representation 

di = a(L)rs, 
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where a(L) =1+a,L + a2L*... and where 

v = di — Proj(d, | Hy-1). 
Here, Proj(- | Ht) is the linear projection operator onto the infor- 
mation set H; = {d:,d:-1,d:_2,...}. We assume that the econo- 
metrician uses the information set H;, which may be much smaller 


than the economic agents’ information set, J, Assuming that a(L) 
is invertible, 


o(L)d, =|, 
where ¢(L) =1—¢,L — ¢2L?... 


Using (2-11) and the law of iterated projections, we obtain 


p: = Proj DD bdipi | He) + wr, (2-12) 


tl 
where e m 
wi = E(X diyi | Ie) — Proj($ bdi: | Hi), 
i=1 i=1 
and Proj(w; | Ht) = 0. Because Proj(- | H+) is the linear projection 
operator onto H+, 


E(X diyi | He) = 5(L)d:, 
i=1 
here 6(L) = 6, + 6,L.... Following Hansen and Sargent [1980, 
Appendix A], we obtain the restrictions imposed by (2-11) on 6(L) 
and ¢(L): 


bL~*(a(L) - AD _ bL (1 - $7*(b)9(L)) 
1 — bL- 1— bL! 

We now parameterize ¢(L)as a q-th order polynomial: 

desi = idi +... + Ggdt—g+1 + Vita: 


Then (2-13) can be used to show that 6(L) is a finite order polyno- 
mial and to give a explicit formula for the coefficients for 6(L) (see 
West [1987] for the formula, which is based on Hansen and Sargent 
[1980] and on West [1988] for for deterministic terms when d(t) has 
a nonzero mean.) Thus 


6(L) = _ (2-13) 


Pe = rdi +... + bgdi—g41 + We, 
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where 6,’s are functions of b and ¢;,’s: 


bo = {1 — ġ(b)} ~ 
5; = ôy(b){1 — 8p(b)] + (jn +bpj+2 +... +0) (2-14) 
for j=1,...p. 
These are the nonlinear restrictions which Equation (2-11) implies. 


Let zı be a vector of random variables in H,. For exam- 
ple, zı = (d:,...,ds-g41). The unknown parameters b and ¢,’s 
can be estimated by applying the GMM to moment conditions 
E (zitea) = 0 and E(zitWiy1) = 0. 


Let zx be a random variab le in [,, say dą, and 


De = D(per1 + deg) + Uepi- 


Then (2-9) implies another set of moment conditions E(ze:ut41) = 
0, which can be used to estimate b. 


Hansen and Sargent’s method is to use three sets of the moment 
conditions, E(z1:%41) = 0, E(zuwi41) = 0, and E(zq,ut41) = 0, 
simultaneously to form a GMM estimator for the unknown para- 
meters b and ¢,’s with the restrictions (2-14) imposed. 


Because w; is in h4}, and r is in Hi41, uz and v are serially 
uncorrelated. It should be noted, however, w, is not necessarily 
in H,4,;. Hence w; has an unknown order of serial correlation. 
The estimation of the long-run covariance matrix of the GMM 
disturbance must allow for an unknown order of serial correlation. 


2.4 GMM in the Presence of 
Nonstationary Variables 


This section illustrates applications of GMM to structural economic 
models in the presence of nonstationary variables. First, Cooley 
and Ogaki’s [1996] cointegration—Euler equation approach is de- 
scribed. Then applications of GMM to structural error correction 
models (ECMs) are explained. When some of the random variables 
in the system are unit root nonstationary and are cointegrated, 
then Error Correction Models (ECM) are widely used. As Engle 
and Granger [1987] show, an ECM representation exists when the 
variables are cointegrated and vice versa. The standard ECMs are 
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reduced form models just as VAR models are as pointed out by 
Urbain [1992] and Boswijk [1994], [1995]. 


2.4.1 The Cointegration—Euler Equation Approach 


This section explains Cooley and Ogaki’s [1996] cointegration— 
Euler Equation approach, which combines Ogaki and Park’s [1998] 
cointegration approach to estimating preference parameters with 
Hansen and Singleton’s [1982] Euler equation approach based on 
GMM. In the first step of this approach, a cointegrating regression is 
applied to an intratemporal first order condition for the household’s 
maximization problem to estimate some preference parameters. 
In the second step, GMM is applied to an Euler equation after 
plugging in point estimates from the cointegrating regression in the 
first step. Because of the first step estimators are super consistent, 
asymptotic properties of the GMM estimators in the second step 
are not affected by the first step estimation. 


Cooley and Ogaki’s method is applicable when the system 
consists of cointegrated nonstationary variables and stationary vari- 
ables with nonlinear moment conditions. Kitamura and Phillips 
[1997] provide a general asymptotic theory for linear GMM with 
both stationary and nonstationary variables. The results in Park 
and Phillips [1997] should make it possible to develop a general as- 
ymptotic theory for nonlinear GMM with nonstationary variables. 


This section explains Cooley and Ogaki’s application of the 
approach on consumption—leisure choice model for time nonsepara- 
ble preferences which are additively separable for consumption and 
leisure. Ogaki, Ostry, and Reinhart [1996] and Ogaki and Reinhart 
[1998] apply the approach to estimate the intertemporal elasticity 
of substitution. 


Consider a simple intraperiod utility function that is assumed 
to be time- and state-separable and separable in nondurable con- 
sumption, durable consumption, and leisure 


u(t) = n a _ : 


where v(-) is a continuously differentiable concave function, C(t) 
is nondurable consumption, and v(t) is leisure. Time nonseparable 


+ (u(t), 
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preferences can be treated relatively easily with the cointegration 
approach as in Cooley and Ogaki [1996]. 


The usual first order condition for a household that equates 
the real wage rate with the marginal rate of substitution between 
leisure and consumption is 


/ 
wia = O, 

(t)-* 
where W(t) is the real wage rate. We assume that the stochastic 
process of leisure is (strictly) stationary in the equilibrium as in 
Eichenbaum, Hansen, and Singleton and that the random variables 
used to form the conditional expectations for stationary variables 
are stationary. Then an implication of the first order condition 
is that In(W(¢)) — aln(C(t)) = In(v’(c(4)) is stationary. When 
we assume that the log of consumption is difference stationary, 
this implies that the log of the real wage rate and the log of con- 
sumption are cointegrated with a cointegrating vector (1,—a). The 
cointegration—Euler equation approach exploits this cointegration 
restriction to identify the curvature parameter «œ from cointegrating 
regressions. 


In the first step, a cointegration regression is used to estimate 
a from the stationarity restriction. Because the log real wage rate 
and the log consumption are cointegrated, either variable can be 
used as a regressand. In finite samples, the empirical results will be 
different depending on the choice of the regressand. However, the 
results should be approximately the same as long as cointegration 
holds and the sample size is large enough. 


In the second step, the GMM procedure is applied to the asset 
pricing equation with the estimate of a from the cointegrating 
regression is plugged into the asset pricing equation. This two 
step procedure does not alter the asymptotic distribution of GMM 
estimators and test statistics because the cointegrating regression 
estimator is super consistent and converges at a rate faster than 
TŻ. 
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2.4.2 Structural Error Correction Models 


In this section, we discuss the relationship between structural mod- 
els and Error Correction Models (ECMs). Following Ogaki and 
Yang [1997], we adopt a slightly more general definition of struc- 
tural ECMs than Urbain [1992] and Boswijk [1994], [1995]. Let 
Y(t) be an n-dimensional vector of first difference stationary ran- 
dom variables. We assume that there exist p linearly independent 
cointegrating vectors, so that A'Y (t) is stationary, where A’ is a 
(p x n) matrix of real numbers whose rows are linearly independent 
cointegrating vectors. Consider a standard ECM 


AY(t+1) =k + GA'Y (t) + RAY (t) + RAY (t — 1)+ 
...+FY(t-p+1)+v(t+1), 


where k is a (n x 1) vector, G is a n x p matrix of real numbers, 
v(t) is a stationary n-dimensional vector of random variables with 
Proj(v(t + 1) | H:--) = 0. In many applications 7 = 0, but we will 
give examples of applications in which 7 > 0.4 There exist many 
ways to estimate (2-15). For example, Engle and Granger’s two 
step method or Johansen’s Maximum Likelihood methods can be 
used. 


(2-15) 


In many applications of standard ECMs, elements in G are 
given structural interpretations as parameters of the speed of ad- 
justment toward the long-run equilibrium represented by A'Y (t). 
It is of interest to study conditions under which the elements in 
G can be given such a structural interpretation. In the model of 
the previous section, the domestic price level gradually adjusts to 
its PPP level with a speed of adjustment b. We will investigate 
conditions under which b can be estimated as an element in G 
(2-15). 

The standard ECM, (2-15), is a reduced form model. A class 


of structural models can be written in the following form of a 
structural ECM: 


CoAY (t + 1) = d+BA'Y (t) + CAY (t) + C2AY (t — 1)+ 


2-16 
...+ C,AY(t—p+1)+u(t+1), ear) 


4 We will treat more general cases in which the expectation of v(t + 1) con- 
ditional on the economic agents’ information is not zero, but the linear 
projection of v(t + 1) onto an econometrician’s information set (which is 
smaller than the economic agents’ information set) is zero. 
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where C; is a (n xn) matrix, d is a (n x 1) vector, and B is a (n x p) 
matrix of real numbers. Here Co is a nonsingular matrix of real 
numbers with ones along its principal diagonal, u(t) is a stationary 
n-dimensional vector of random variables with E[u(t+1) | Hy_7] = 
0. Even though cointegrating vectors are not unique, we assume 
that there is a normalization that uniquely determines A, so that 
parameters in B have structural meanings. 


In order to see the relationship between the standard ECM and 
the structural ECM, we premultiply both sides of (15.2) by Cy* to 
obtain the standard ECM (2-15), where k = Cod, G = Cj'B, 
F; = Cj*C;, and v(t) = Cy'u(t). Thus the standard ECM es- 
timated by Engle and Granger’s two step method or Johansen’s 
[1988] Maximum Likelihood methods is a reduced form model. 
Hence it cannot be used to recover structural parameters in B, nor 
can the impulse—response functions based on v(t) be interpreted in 
a structural way unless some restrictions are imposed on Co. 


As in a VAR, various restrictions are possible for Cy. One 
example is to assume that Co is lower triangular. If Co is lower 
triangular, then the first row of G is equal to the first row of B, 
and structural parameters in the first row of B are estimated by 
the standard methods to estimate an ECM. 


2.4.3 An Exchange Rate Model with Sticky Prices 


This section presents a simple exchange rate model in which the 
domestic price adjusts slowly toward the long-run equilibrium level 
implied by Purchasing Power Parity (PPP), which is used by Ogaki 
and Yang [1997] to motivate a particular form of a structural ECM 
in the next section. This model’ two main components are a slow 
adjustment equation and a rational expectations equation for the 
exchange rate. The single equation method proposed in Section 
2.4 is only based on the slow adjustment equation. The system 
method utilizes both the slow adjustment and rational expectations 
equations. 


Let p(t) be the log domestic price level, p*(t) be the log for- 
eign price level, and e(t) be the log nominal exchange rate (the 
price of one unit of the foreign currency in terms of the domes- 
tic currency). We assume that these variables are first difference 
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stationary. We also assume that PPP holds in the long run, so 
that the real exchange rate, p(t) — p*(t) — e(t) is stationary, or 
y(t) = (p(t), e(t), p*(t))’ is cointegrated with a cointegrating vector 
(1,-1,-1). Let u = E(p(t) — p* (t) — e(t)), then p can be nonzero 
when different units are used to measure prices in the two countries. 


Using a one-good version of Mussa’s [1982] model, the domestic 
price level is assumed to adjust slowly to the PPP level 


Ap(t + 1) =b|u + p*(t) + e(t) — p(t)| + E,[p*(¢ + 1) + e(t + 1)] 
— [p* (t) + e(t)], 
(2-17) 


where Az(t+1) = x(t+1)—<(t) or any variable x(t), E(- | I) is the 
expectation operator conditional on J, the information available to 
the economic agents at time t, and a positive constant b < 1 is an 
adjustment coefficient. The idea behind Equation (2-17) is that 
the domestic price level slowly adjusts toward its long-run PPP 
level of p*(t) + e(t). The adjustment speed is slow when b is close 
to zero, and the adjustment speed is fast when b is close to one. 


From Equation (2-17), we obtain 
Ap(t + 1) = d + b[p*(t) + e(t) — p(t)] 
+ Ap*(t +1) + Ae(t +1) +e(t +1), 
where d = bu, e(t +1) = E,[p*(¢ +1) +e(t + 1)] — [p*(€+1) + e+ 


1)]. Hence e(t + 1) is a one—period ahead forecasting error, and 
Efe(t +1) |] =0. 

Equation (2-18) motivates the form of the structural ECM used 
in the next section. The single equation method described in the 
next section is based only on (2-18), and can be applied to any 
model with a slow adjustment equation like (2-17). 


(2-18) 


We close the model by adding the money demand equation and 
the Uncovered Interest Parity condition. Let 


m(t) = Om + p(t) — hi(t) (2-19) 
i(t) = i*(t) + Ele(t + 1) | L] — e(t), (2-20) 


where m(t) is the log nominal money supply minus the log real 
national income, i(t) is the nominal interest rate in the domestic 
country, and i*(t) is the nominal interest rate in the foreign country. 
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Here, we are assuming that the income elasticity of money is one. 
From (2-19) and (2-20), we obtain 

Elet +1) |I] -e = (1/h)[-Om HPOO 9 
- MEIE + 1) = pO] lE), 


where 
w(t) = m(t) + hr*(t), 
and r*(t) is the foreign real interest rate: 
r(t) =@(t) — Elp (t+ 1) | 2] +p) 


Following Mussa, solving (2-20) and (2-21) as a system of stochas- 
tic difference equation for E|p(t + j) | L] and Efe(t+ j) | I] for 
fixed t results in 


PORAT OD: aa% 
—E|F(t — j) | L-;-1]} 


elt) = FEIF (E) | -pO ppl), 23) 
where 


F(t) = (1 — (8) $ Polt + j), 
j=0 
and ô = h/(1 + h). 
We assume that w(t) is first difference stationary. Since ô is a 
positive constant that is smaller than one, this implies that F(t) is 
also first difference stationary. From (2-22) and (2-23), 


e(t) Te) — p(t) = 


Sa-i — 6) {E[F(t — j) | Lj] — EIF G — 9) | L-j-r]}- 


Since the a hand side of this equation is stationary,° e(t) + 
p*(t) — p(t) is stationary. Hence (p(t), e(t), p*(t)) is cointegrated 
with a cointegrating vector (1, —1, —1). 

In order to obtain a structural ECM representation from the 
exchange rate model, we use Hansen and Sargent’s [1980], [1982] 


5 This assumes that E,[F(t)] — Er-1[F(t)] is stationary, which is true for a 
large class of first difference stationary variable F(t) and information sets. 
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formula for linear rational expectations models. From (2-23), we 
obtain 


Ae(t+1) = n- 2E — DES 6) Aw(t +j +1) |I 


j=0 (2-24) 
+ = Arlt +1) — Ap*(t +1) +ee(t+1), 


where eelt + 1) = 4 {E[Fis: | I] — ElF | l}, so that the 
law of iterated expectations implies Ele.(t¢ + 1) | I] = 0. Because 
this equation involves a discounted sum of expected future values 
of Aw(t), the system method in the next section is applicable. 


Hansen and Sargent [1982] proposes to project the conditional 
expectation of the discounted sum, E[}> 6? Aw(t + j + 1) | I], onto 
an information set H,, which is a subset of I, the economic agents’ 
information set. Let E(- | H,) be the linear projection operator 
conditional on an information set H, which is a subset of L. 


We take the econometrician’s information set at t, H+, to be 
the one generated by the linear functions of the current and past 
values of Ap*(t). Then replacing the best forecast of the economic 
agents, EZ 6 Aw(t + j + 1) | L], by the econometrician’s linear 
forecast based on H(t) in Equation (2-24), we obtain 


Ae(t+1) = a eG = DRS 6) Aw(t + 7 +1) | HE) 
ste +1) —Ap*(t+1) + ue(t+1), 
where 
u(t + 1) = eelt +1) pal me e 6) EAD & Aw(t + j+ 1)] 


— Proj eA t+j+1| HOJ} 
Because H; is a subset of I,, we obtain E[ua(t + 1) | HJ =0. 


Since El. | H;] is the linear projection operator onto H;, there 
exist possibly infinite order lag polynomials (L), y(L), and (L) 
such that i 

E[Ap*(t + 1) | H(t)] = B(L)Ap*(é) 


E[Aw(t + 1) | H(t)] = 7(L)Ap* (t) 


a 5 Aw(t + j +1) | HE] = €(L)Ap*(t). 


j=0 
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Then following Hansen and Sargent {1980, Appendix A], we obtain 
the restrictions imposed by (2-24) on €(L): 


é(L) = y(Z) — 6L~*y(6){1 — 68(6)}*{1 — LB(L)} 
1-6L-} ` 
Assume that linear projections of Ap*(t + 1) and Aw(t + 1) onto 
H(t) have only finite number of Ap*(t) terms: 

E[Ap*(t + 1) | H(t)] = 


GB, Ap" (t) + BoAp*(t-—1)+...+ 6,Ap*(t —pt+1) 
(2-26) 


(2-25) 


E[Aw(t + 1) | H(t)] = 
VAP* (t) + Ap (t - 1) +...+piAp"(t—p : 2) 
2-27 
Here we assume /3(L) is of order p and +(L) is of order p—1 in order 
to simplify the exposition, but we do not lose generality because 
any of 3; and +; can be zero. Then as in Hansen and Sargent [1982], 
(2-25) implies that €(L)( + &L + ... + pL”, where 
£o = (6){1 — 86 (8)} 
E; = 6y(5){1 — 68(6)} (Biri + biaa +... +E B) (2-28) 
+ (yj tiy t... +y) for j=1,...,p. 
Thus 
Proj[ 6 Aw(t +j +1) | HO] = 
> (2-29) 
€,Ap*(t) + &.Ap*(t -—1)+...+ & Ap*(t —p +1). 
Using (2-18), (2-24), (2-26) and (2-27) we obtain a system of four 
equations 
Ap(t + 1) = d+ Ap*(t +1) + Ae(t + 1) — blp(t) — p* (t) — e(t)] 
1 
Ae(é+1) = py awe +1) — Ap*(¢+1)+ 
ag Ap*(t) + a€Ap*(t — 1) +... + ag) Ap*(t — p+ 1) + u(t +1) 
Ap*(t + 1) = B,Ap*(t) + BAp*(t-1)+... 
+ B,Ap*(t — p+1) + us(t +1) 
Aw(t + 1) = yAp*(t) + y2Ap*(t-—1) +... 


+ Yp-1Ap* (t —pt+ 2) + ug(t + 1) 
(2-30) 
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where a = 4=1(1 — 6) and u(t + 1) = e(t +1). Given the data for 
[Ap(t+1), Ae(t+ 1). Ap*(t+1), Aw(¢+1)]’, GMM can be applied 
to these four equations as will be discussed in the next section. 


The exchange rate model can be written in the SECM form 
(2-16) as in the system of equations (2-30): we have y(t) = 
[Ap(t + 1), Ae(é +1), Ap*(t + 1), Aw(t + 1)]’, B = [—b,0,0, 0)’, 


= (1, —1,-1,0]', 
1 -1 -1 0 
_|—(1/bh) 1 1 0 7 
Co = 0 0 1 ol (2-31) 
0 0 0 1 
and 
00 0 0 
_|0 0 ag 0 
BV go B; 0 
for j = 1,...,p. Note that for any nonzero constant 


W, Ẹ(1,—1,—1Y is also a cointegrating vector. However, the first 
row of B is b only when W is normalized to one. 


It is instructive to observe the relationship between the struc- 
tural ECM and the reduced form ECM in the exchange rate model. 
Because 
bh/(bh—1) bh/(bh-1) 0 0 

1/(bh— 1) bh/(bh-1) -1 0 
0 0 1 0 
0 0 0 1 


G = C)'B = [—b?h/(bh — 1), —b/(bh — 1), 0,0]. Comparing G 
and B, observe the way in which the contemporaneous interactions 
between the domestic price and the exchange rate affect the speed of 
adjustment coefficients. The speed of adjustment coefficient for the 
domestic price is b in the structural model, while it is b?h/(bh—1) in 
the reduced form model. The error correction term does not appear 
in the second equation for the exchange rate in the structural ECM, 
while it appears with the speed of adjustment coefficient of b/(bh — 
1) in the reduced form model. 


Cy = 


$. 


In the exchange rate model, b is a structural parameter of 
interest. For the purpose of estimating b in the model, the re- 
striction that Co is lower triangular is not attractive. However, as is 
clear from Equation (2-30), the structural ECM from the one-good 
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version of the exchange rate model does not satisfy the restriction 
that Co is lower triangular for any ordering of the variables. Even 
though some structural models may be written in lower triangular 
form, this example suggests that many structural models cannot be 
written in that particular form. 


2.4.4 The Instrumental Variables Methods 


Because standard methods of estimating Equation (2-15) may not 
recover the structural parameters of interest in B, Ogaki and Yang 
[1997] propose two instrumental variables methods, which does not 
require restrictions on Co. 


First, we consider a single equation method, which applies an 
IV method to a slow adjustment equation. Imagine that we are 
interested in estimating the first row of Equation (2-16). In some 
applications, the cointegrating vectors are known, and thus the 
values of A are known. In other applications, the values of A 
are unknown. In the case of the unknown cointegrating vectors, a 
two step method that is similar to Engle and Granger’s [1987] and 
Cooley and Ogaki’s [1996] methods can be used. In this two-step 
method, the cointegrating vectors are estimated in the first step. 


In the first step, we estimate A, using a method to consistently 
estimate cointegrating vectors. There exist many methods to esti- 
mate cointegrating vectors. Johansen’s Maximum Likelihood (ML) 
Estimators for Equation (2-15) can be used for this purpose. If p 
is equal to one, estimators based on regressions that are as efficient 
as Johansen’s ML estimators such as Phillips and Hansen’s [1990] 
Fully Modified Estimation Method, Park’s [1992] Canonical Coin- 
tegrating Regression, and Stock and Watson’s [1993] estimators 
can be used. Ordinary Least Squares estimators are also consistent 
when p is equal to one, but not as efficient as these estimators. We 
assume that Apr is the first step estimator, where T is the sample 
size, and that Ar converges to A at a faster rate than T?.® 


€ Usually, Ar converges at the rate of T, but there are cases where Ar 
converges at the rate of T? (see West [1988)]). 
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In the single equation method, an IV method is applied to 


Ayi(t + 1) = dy — h Ay2(t + 1) — ... — cp Ayn(t + 1) + b A’y(t) 
+ cAy(t) + Ay(t—1)+...+chAy(t—pt+1)+u(t+1), 
(2-32) 


where y;(t) is the i-th element of y(t)d; is the first element in d, cj, 
is the i-th element of the first row of Co, bı is the first row of B, c} 
is the first row of C;, and u(t) is the first element of u(t). When 
Eluy(t +1) | H,_,] = 0, any stationary variable in the information 
set available at time t — 7 that is correlated with variables in the 
right hand side of Equation (2-32) can be used as an instrumental 
variable. 


The system method combines the single equation method with 
Hansen and Sargent’s [1982] procedure to impose nonlinear restric- 
tions implied by rational expectations models. 


Let y(t) = (yi(t), yo(t), ys(t), ys(t))’ be a 4 x 1 vector of 
random variables with an structural ECM representation (2-16) 
and only one linearly independent cointegrating vector A such that 
A’y(t) is stationary. In the following, y(t) is partitioned into four 
subvectors, and each subvector is given a different role. For exposi- 
tional simplicity, we assume that each subvector is one dimensional 
so that y(t) isa 4x1 vector, and that y(t) has only one cointegrating 
vector. 


The first element of y(t) represents a slow adjustment as in 
Equation (2-18), with nonzero b, where E[u;(t + 1) | Hy_-] = 0. 
We assume that the second element of y(t) is related to a discounted 
sum of expected future values of the fourth element in the following 
form: 


Ayo(t + 1) = dz — cĉ, Ay (t + 1) — c2,Ay3(t + 1) — R, Ayt + 1) 


oO 
+aE[ &Ays(t+i+1) | h)+e(t+1), 
j=0 
(2-33) 
where 6 is a positive constant that is smaller than one, and a is 
a constant. As pointed out by Hansen and Sargent, many linear 
rational expectations models imply that one variable is a geomet- 
rically declining weighted sum of expected future values of other 
variables. 


Hansen and Sargent’s [1982] methodology is to project the con- 
ditional expectation of the discounted sum, E[S 6’ Aya(t + 7 +1) | 
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I], onto an information set H+, which is a subset of J;, the economic 
agents’ information set. Let E(- | H;) be the linear projection 
operator conditional on an information set H, which is a subset of 
I. Replacing the conditional expectation by the linear projection 
gives 


Ayo(t + 1) = dz — c Ay: (t + 1) — Ays (t + 1) — ey Aya (t + 1) 


+ ab[S~ 6 Ays(t+j+1) | H] + u(t +1), 
j=0 
(2-34) 
where 


foe) 


u(t +1) = eelt +1) + E Ayt +j +1) | hI 


— B Ayt +j +1) | Hd. 


j=1 
Because H, is a subset of I,, we obtain Efļuz(t + 1) | H4 = 0. 


The current and past values of the first difference of the third 
element of y(t) are used to form the econometrician’s information 
set H,. Since Proj [- | H;] is the linear projection operator onto H, 
there exist possibly infinite order lag polynomials 6(L), y(Z), and 
€(L). such that 


E[Ays(t + 1) | H(t)] = B(L)Ays(t) (2-35) 
E(Ay(t + 1) | H()] = 7(L)Ays(t) (2-36) 
PIS 8 Aya(t + j +1) | H()] = &(L)Ays(t). (2-37) 


Then following Hansen and Sargent [1980, Appendix A], we obtain 
the restrictions imposed on €(L) : 


e(L) = (L) — 6L~*7(6){1 — 68(6)} {1 — LA(L)} 
1— ôL- ` 
Substituting (2-37) into (2-34) gives the equation 
Ayo(t + 1) = dz — cf Ayı (t + 1) — c, Ays(t + 1) — c, Aya(t + 1) 
+ a€(L)Ays(t) + u(t + 1), 


(2-38) 


(2-39) 
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where €(L) is given by (2-38). We now make an additional as- 
sumption that the lag polynomials G(L) and y(L) are finite order 
polynomials, so that 

(2-40) 
Aya(t+1) = 1: Ays(t)+y2Ay3(t—1)+. .-+%p-1y3(t—p+2)+us4(t+1) 

(2-41) 
where Proj [u;(t + 1) | H;] = 0 for i = 3,4. Here we assume ((L) 
is of order p and +(L) is of order p — 1 in order to simplify the 
exposition, but we do not lose generality because any of 8; and y; 
can be zero. Then as in Hansen and Sargent [1982], (2-38) implies 


ĉo =7(5){1 — 68(6)}~* 
Ez = 8y(5){1 — 68(5)}* (Biti + SBj42 +--+ 6? 7 Bp) (2-42) 
+ (yj + Oyj +...4+ 8 Fyy) for fj =1,...,p. 
In the SECM form, we have B = [—b,0,0,0]’, A = [1, —1, —1, 0)’, 


1 1 1 
1 G2 Cos Coa 


c, 1 3 Coa 
a= 0 0 1 Of? 
0 0 0 1 
and 1 1 1 1 
Cji j2 Cj3 Ga 
A E 0 
j 0 0 B ol 


for j = 1,...,p, where yp = 0. 


We have now obtained a system of four equations that consist 
of (2-32) (2-39), (2-40), and (2-41). Because E(u,(t) | --) = 0 
and E(u;(t) | H) = 0, we can obtain a vector of instrumental 
variables z;(t) in I,_, for u, (t) and z;(t) in H; for u;(t) (i = 2,3, 4). 


Because the speed of adjustment b for y(t) affects dynamics of 
other variables,’ there will be cross—equation restrictions involving 
b in many applications in addition to the restrictions in (2-42). 
Using the moment conditions E[z;(t)u;(t)| = 0 for i =1,...,4, we 
form a GMM estimator, imposing the restrictions (2-42) and the 
other cross—equation restrictions implied by the model. 


7 Note that only y1(t) adjusts slowly, but b affects dynamics of other variables 
because of interactions of y1(¢) and those variables. 
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Given estimates of cointegrating vectors from the first step, this 
system method provides more efficient estimators than the single 
equation method as long as the restrictions implied by the model 
are true. On the other hand, the single equation two step method 
estimators are more robust because misspecification in other equa- 
tions does not affect their consistency. 


2.5 Some Aspects of GMM Estimation 


2.5.1 Numerical Optimization 


For nonlinear models, it is usually necessary to apply a numerical 
optimization method to compute a GMM estimator by numerically 
minimizing the criterion function, Qr(0). The Newton—Raphson 
method (see, e.g., Hamilton [1994, Chapter 5] is often used with an 
approximation method to calculate the hessian matrix. A problem 
with the Newton—Raphson method and other practical numerical 
optimization methods is that global optimization is not guaranteed. 
The GMM estimator is defined as a global minimizer of a GMM 
criterion function, and the proof of its asymptotic properties depend 
on this assumption. Therefore, the use of a local optimization 
method can result in an estimator that is not necessarily consistent 
and asymptotically normal. 


If the criterion function and parameter space are convex, then 
the criterion function has a unique local minimum, which also is 
the global minimum. In this case, a local optimization algorithm 
started at any parameter values should be able to reach an approx- 
imate global minimum. 


For nonconvex problems, however, there can be many local 
minima. For such problems, an algorithm called multi-start is 
often used for GMM applications. In this algorithm, one starts 
a local optimization algorithm from initial values of the parameters 
to converge to a local minimum, and then one repeats the process a 
number of times with different initial values. The estimator is taken 
to be the parameter values that correspond to the small value of 
the criterion function obtained during the multi-start process. 
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It should be noted that this multi-start algorithm is used for 
a given distance matrix. When the two stage or iterative GMM 
estimators are used, a different distance matrix is used in each 
stage, and hence a different criterion function is minimized. In 
most GMM programs, one need to save the distance matrix in a 
file in order to apply the multi-start algorithm in each stage. 


A problem with the multi-start algorithm, however, is that 
it does not necessarily find the global optimum. Therefore, the 
estimator it delivers is not necessarily consistent and asymptotically 
normal. Andrews [1997] proposes a simple stopping-rule procedure 
that overcomes this difficulty. 


2.5.2 The Choice of Instrumental Variables 


In the NLIV model discussed in Section 2.2, there are infinitely 
many possible instrumental variables because any variable in J, 
can be used as an instrument. 


Nelson and Startz [1990] perform Monte Carlo simulations to 
investigate small sample properties of linear instrumental variables 
regressions. They show that instrumental variables estimators have 
poor sample properties of the instrumental variables estimators 
when the instruments are weakly correlated with explanatory vari- 
ables. In particular, they find that the Chi-square test tends to 
reject the null too frequently compared with its asymptotic distrib- 
ution, and that t-ratios tend to be too large when the instrument is 
poor. Their results for t-ratios may seem counterintuitive because 
one might expect that the consequence of having a poor instrument 
would be a large standard error and a low t-ratio. Their results 
may be expected to carry over to NLIV estimation. Pagan and Jung 
[1993] provide practical recommendations for checking whether or 
not instrumental variables are valid in the context of GMM and 
other instrumental variables estimation. 


Hansen [1985] characterizes an efficiency bound (that is, a 
greatest lower bound) for the asymptotic covariance matrices of the 
alternative GMM estimators and optimal instruments that attain 
the bound. Since it can be time consuming to obtain optimal 
instruments, an econometrician may wish to compute an estimate 
of the efficiency bound to assess efficiency losses from using ad 
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hoc instruments. Hansen [1985] also provides a method for cal- 
culating this bound for models with conditionally homoskedastic 
disturbance terms with an invertible MA representation.? Hansen, 
Heaton, and Ogaki [1988] extend this method to models with 
conditionally heteroskedastic disturbances and models with an MA 
representation that is not invertible? Hansen and Singleton [1996] 
calculate these bounds and optimal instruments for a continuous 
time financial economic model. Chamberlain [1987] and Hansen 
[1993] show that the GMM efficiency bound coincides with the 
semi-parametric efficiency bound for finite parameter maximum 
likelihood estimators. 


Tauchen [1986] and Hansen and Singleton [1996] find that the 
GMM estimators with Hansen’s [1985] optimal instruments often 
show poor small sample properties when they are applied to the 
C-CAPM. West and Wilcox [1996], however, find that the GMM 
estimators with optimal instruments are substantially more effi- 
cient in small samples for an inventory model with plausible data 
generation processes. 


2.5.3 Normalization 


In the habit formation model, the GMM disturbance term is nor- 
malized to avoid the trivial solutions of the sample counterpart 
of the moment conditions as explained in Section 3.2. In some 
applications, the GMM disturbance is normalized to incorporate 
a priori information about parameters. An extreme example of a 
priori information is that some parameter values are not admissible. 
Another example is that certain parameter values are not very plau- 
sible. This section discusses some issues regarding normalizations. 


Consider a GMM estimator is based on moment restrictions of 
the form (2-1). Let ¢(0) be a real valued function. Then one can 
define a new function f*(X;, 0) = f(X:, 0)¢(0), and we still have 
moment restrictions 


E(f*(Xz, 90)) = 0. 


Hayashi and Sim’s [1983] estimator is applicable to this example. 
Heaton and Ogaki [1991] provide an algorithm to calculate efficiency bounds 
for a continuous time financial economic model based on Hansen, Heaton, 
and Ogaki’s [1988] method. 
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One can apply GMM to these moment restrictions instead of ap- 
plying to (2-1) 


In NLIV estimation in Section 2.2, even random variables can 
be used to normalize the GMM disturbance function. If y; is 
a vector of random variables in J; and if ¢(y%,0) is a (measur- 
able) real valued function, then one can define a new function 
h*(x:,0) = h(x, 0)O(y:, 6). Because (y+, 0) is in Ik, h* (£+, 9) still 
satisfies conditional moment conditions E[h*(x;, 6) | I] = 0. Thus 
the GMM estimation can be applied to h*(a;, 6). 


These ideas are now illustrated for the habit formation model in 
Section 3.2. In the model, the Euler equation implies E(e, | T+) = 0, 
where e, = {6(S2,, + 6a:52,9)Rin — (Sp * + ôa $54) }/{$7 [1 + 
6a] }. 19 


One problem that researchers have encountered in these appli- 
cations is that C; + a,C,_, may be negative when a a is close to 
minus one. The values of a, which makes C; + a,C,_1 negative are 
not admissible. In order to incorporate this a priori information, 
one can program the procedure which define h(z;,,0), so that very 
large numbers are returned as the values of h(x;,0) when a; falls 
to the nonadmissible region." 


When a, is positive and is greater than one, we can obtain a 
well behaved utility function. However, one may argue that it is not 
plausible for a, to be greater than one: why is the previous period’s 
consumption more important than this period’s consumption for 
this period’s utility level? One way to incorporate this type of 
a priori information is to use a Bayesian method. Another way 
is to penalize such implausible parameter values in defining the 
GMM disturbance. Let f(X;,6) be the original GMM disturbance 
function, where 6 = (6, a,,a). Then define 


= 1 ifa <1 ia 
0= E ifa <1 AR) 


and f*(X:,0) = f(X:,0)¢(0). Now f* = (X;,6) can be used as the 
new GMM disturbance. Here the function ¢ is designed so that it is 


10 In Hansen—Heaton—Ogaki GMM package, the GMMHF.EXP file (Appendix 
C), which is a minor modification of a program used by Cooley and Ogaki 
[1996], is included to give an example of programing. 

11 It is necessary to modify the numerical derivative procedure in order 
to prevent these fictitious large numbers are used to calculate numeri- 
cal derivatives. Hansen~Heaton—Ogaki package provides an example in 
GRADQG.PRC, which is used by GMMHF.EXP. 
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differentiable at a, = 1 and it does not affect the function f when 
a, < 1. The latter feature is important because of small sample 
considerations. 


One might argue that the curvature parameter a, which will 
be close to the Relative Risk Aversion parameter as explained by 
Ferson and Constantinides under some circumstances, is not likely 
to be greater than ten (see Mehra and Prescott [1985]). This 
restriction can be incorporated by a normalization similar to (3). 


Even though any differentiable function can be used as the 
normalization function ¢ without disturbing the consistency of the 
GMM estimator, the small sample properties of GMM estimator 
will be affected by normalizations. It is thus desirable not to 
disturb the GMM function for the parameter region where we do 
not have any a priori information. This idea is incorporated in 
the function given in (2-43). When different normalizations give 
very different results, it is recommended that Hansen, Heaton, and 
Yaron’s [1996] continuous updating estimator, which is not affected 
by normalizations. 
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Chapter 3 


COVARIANCE MATRIX ESTIMATION 


Matthew J. Cushing and Mary G. McGarvey 


This chapter focuses on estimating the covariance matrix resulting 
from GMM estimation. A consistent estimate of the covariance 
matrix is necessary for both statistical inference and efficient pa- 
rameter estimation. In the spirit of GMM estimation we focus 
attention on robust estimation techniques. Because the resulting 
covariance matrix estimates are consistent for general classes of pos- 
sibly heterogeneously distributed, autocorrelated error processes, 
these heteroskedastic, autocorrelation consistent (HAC) covariance 
estimators have a wide range of applications beyond GMM. 


In this chapter we consider both asymptotic results and small 
sample properties of covariances matrix estimators arising from 
GMM estimation. 


3.1 Preliminary Results 


Our ultimate goal is to perform statistical inference on some p x 1 
parameter vector ło. Recall from Chapter 1, Definition 1.2, that 
the GMM estimator of ĝo is defined as 


Ôr S argmingQr (6) ’ 


where the criterion function is Qr(0) = fr(0)’Arfr(@). Sufficient 
conditions for the estimator’s consistency and asymptotic normality 
are discussed in Chapter 1. Here we are interested in developing 
estimators of the asymptotic covariance matrix, 


(FL Ar Fr) FLApVpAr Fy (FL ArFr). 
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Notice that the asymptotic covariance matrix (as defined by 
White [1984]) of the GMM estimator depends on the sample size. 
This is because we do not require that f(x:,0)) be a stationary 
process or that the weighting matrices Ar and Fr converge to a 
constant matrix independent of T. Stationarity is not necessary 
for the consistency or asymptotic normality of the GMM estimator 
Or nor is it necessary for the consistency of many of the covariance 
matrix estimators discussed in this chapter. 


Constructing Fr typically presents few problems. The sample 
analogue to Fr(0), Fr(0) will ordinarily suffice. Similarly, estimates 
of Ar satisfying Assumption 1.3 are readily constructed. In the case 
where the parameters are exactly identified (p = q), the weighting 
matrix Ar is inconsequential both in the construction of the GMM 
estimator and its asymptotic variance. In the case where there are 
more moment conditions than parameters (q > p), the asymptotic 
variance of Oe: can be minimized by choosing Ar to be a consistent 
estimator of Vp ' (see Lemma 1.1). 


The critical component of the expression for the asymptotic 
variance is Vr. We require a poniston estimator of Vr, i.e., we 
require a Pr, such that Pr — Vr + 0. The rest of this chapter is 
devoted to this task. 


3.2 The Estimand 


We are interested in consistent and possibly efficient estimators of 


Vr = =T var (fr( 6)) = me ESES (£, 9) )F (£s, boy ), 


t=1 s=1 


the average of the autocovariances of the process f(x+,0o). It is 
useful to rewrite Vr in terms of a general autocovariance function 


of f(x, 4). 
Vr = > Qr(j), where 


j=-(T-1) 
1 T 1 ; 

T- 2 E(fifi-;) jz0 
t=j+1 


T © Blfef), j< 


t=-(j—1) 


Qr(j) = 
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and fe = f(x:,90). The stochastic structure of the f; process 
determines the optimal choice of estimator. If the structure of 
the f, process is known, the estimation problem simplifies dramat- 
ically. Before describing the general case, we present some familiar 
examples. 


EXAMPLE 3.1 


Consider a scalar linear regression model, y = 2109 + 
ut, where E(f) = E(aiu) = 0. Suppose that z; 
and u, are covariance stationary processes. E(u, | 
Ut—1, Le, Ue-2, Tr-1,---) = 0 and 


E(ugut, | Uti, Lt, Ue—2, 24-1, --) = 02 


so that u, is serially uncorrelated and conditionally ho- 
moskedastic. Then Vr reduces to 


T 
Vr = Qr(0) = T7! 5 E(ziusuz,) = 0? E(2,2}). 


t=1 
This is the standard OLS covariance matrix. The Method 
of Moments (MM) estimator of Vr, constructed from sam- 
ple analogues to population moments, is 

7. — a2q-1 T 1 

Vr = T De Lilt, 


a = T A ~ wy . w 
where G2 = T~! $; @ and @ = y% — x4r. This is, of 
course, the standard maximum likelihood estimate of the 
covariance matrix. 


EXAMPLE 3.2 
Consider again the standard linear model. Assume now 
that the true error is serially uncorrelated, 
E(u: | Ut—1, Tt, Ut—2, Le-1,-- .) = 0 
but that 


E(upuy|Ur-1, Lt, Ue-2, Le-1, --) = o; 
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so that the regression error u, may be conditionally het- 
eroskedastic. Then Vr reduces to 


= T 
Vr = Qr (0) = TD, E(u). 


This matrix can be consistently estimated by sample ana- 
logues, 

“n T ba oe 

Vr = m~t X LL, 
which is that heteroskedasticity consistent estimator pro- 


posed by White [1980] and provided as a standard option 
in most econometric software packages. 


EXAMPLE 3.3 


Finally, if fe = xru is a linearly regular, covariance sta- 
tionary process and 2.7(j) = 0 for j > m, then Vr is 
Vr = > Qr(j). 
j=—m 


Again, we may use sample analogues to estimate the Q's, 


T 
TE D Lely L,_ ;Ue5, j2090 
a. t=jtl 
Qr(j) = i í T A a : . 
To o ej Ue EU, J <0 
t=—(j—1) 


The sample analogue to Vr, 
Vr = $ Op, 
j=-m 


provides a consistent (but not necessarily desirable) esti- 
mator. 
a 


In the spirit of GMM estimation, most practitioners wish to avoid 
the strong distributional assumptions employed in the above ex- 
amples. It is tempting to follow the MM approach and construct 
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an estimator for Vr by simply replacing all of the population auto- 
covariances with their sample analogues. The resulting estimator, 
Vu, would be, 


T-1 
Pum = 5 Qr(7), where 
j=-(T-1) 
1S FF ; 
TE a Fefi—ys J 20 
A Tes t=j+1 
Qr(j) = T 


TES fiif j<0 


t=—(j-1 


and fi = filt Or). 


Unfortunately, this MM estimator, Pu m, is wholly unsatisfac- 
tory. The problem, asymptotically, is that the number of estimated 
autocovariances grows at the same rate as the sample size. As we 
show in the next section, Vmm is, under suitable regularity condi- 
tions, asymptotically unbiased, but, because the variance does not 
approach zero, it is not consistent in the mean squared error sense. 
The finite sample properties of Pum are even more problematic. 
To see this, rewrite Vmm as, 


T T 
Vum =T 5 D frfa 
r=1 s=1 
In the exactly identified case, the sample moment conditions are 
exactly satisfied. But this implies that Vmm is identically zero 


for all T! The next section presents a class of estimators that 
circumvents some of these problems. 


3.3 Kernel Estimators of V; 


Suppose f; is a stationary process with spectral density given by, 


oo 
SA) = (20)? SO Ue, 
j=- 
where i = /—1. Then, as Hansen [1982] pointed out, the limit of 
Vr as T approaches infinity is 27 times the spectral density matrix 
of f, evaluated at the zero frequency. This relationship between 
the estimand, Vr, and the spectral density matrix of the f; process 
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suggests that standard methods for estimating spectral densities 
would be appropriate. This observation has motivated much of 
the research in this area and most of the results on estimating Vr 
have direct counterparts in the spectral density literature. Some 
differences, however, should be noted. The spectral density lit- 
erature typically assumes stationarity and focuses on estimating 
over frequency bands. The econometrics literature often allows for 
departures from stationarity and focuses on estimating the spectral 
density at a particular point (the zero frequency). In addition, the 
spectral density literature typically takes the f; process, perhaps 
after a mean correction, as primitive, whereas the econometric lit- 
erature pays more attention to the finite sample problems inherent 
in using the observable fi process in place of the unobservable fi 
process. 


3.3.1 Weighted Autocovariance Estimators 


The most widely used estimators of Vr can be written as a weighted 
average of sample autocovariance matrices, 


T-1 
Vr = 5 wsQr(s) , 
s=—(T-1) 


where Parzen termed the sequence of weights, {w,}, the “lag win- 
dow”. These estimators correspond to a class of kernel spectral 
density estimators evaluated at frequency zero. This class of esti- 
mators is closely related to the method of moments estimator, Pu M: 
The MM estimator uses the lag window, w, = 1,Vs. The strategy 
to obtain a consistent estimator here is to choose a lag window in a 
way that the sequence of weights approaches unity rapidly enough 
to obtain asymptotic unbiasedness but slowly enough to ensure that 
the variance converges to zero. 


We concentrate on a particular class of lag windows, scale 
parameter windows, where the lag window can be expressed as 
Ws = k(s/mr). Here mr simply “stretches” or “contracts” the 
distribution and hence acts as a scale parameter. The function 
k(z) is referred to as the “lag window generator”. 


Kernel Estimators of Vr 69 
ASSUMPTION 3.1 


The function k(-) : R — [-1,1], satisfies k(0) = 1, k(z) = 
k(—z)Wz € R, f°. |k(z)| dz < œ, and k(-) is continuous at 
0 and at all but a finite number of other points. 

a 


When the value of the kernel is zero for z > 1, mr is called the 
“lag truncation parameter” because autocovariances corresponding 
to lags greater than mr are given zero weight. The scalar mr is 
often referred to as the “bandwidth parameter” or simply the “scale 
parameter”. 


Table 3.1 lists the properties of a number of popular kernel es- 
timators. The first and simplest is the “truncated window” studied 
in this context by White [1984]. This kernel weights the autoco- 
variances equally but truncates the distribution at lag mr. Newey 
and West [1987], observing that the truncated window does not 
guarantee a positive semi-definite estimate, suggested the Bartlett 
window as an alternative. The Bartlett-Newey—West window has 
linearly declining weights and provides a positive semi-definite es- 
timate by construction. The Parzen window, suggested by Gallant 
[1987], is a standard choice in the spectral density literature and is 
also positive semi-definite. 


The forms of the last three windows, the Daniell, Quadratic 
Spectrum and Tent, are based on frequency domain considerations, 
to which we now turn. 


3.3.2 Weighted Periodogram Estimators 


The properties of lag windows are most easily revealed by examin- 
ing the finite Fourier transform of the lag window, 
j z) , 
W(A,mr) = om Suge 
s=—(T-1) 
Parzen termed this function the “spectral window”. Standard kernel 
estimators can be expressed as weighted integrals of periodogram 
ordinates, 
= J W(A, mr)ip(A)aA, 


-r 
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Table 3.1: Spectral Windows 


WQ) 


k(z) r kp f(z) 
Truncated! 1 for | z|< 1, œ 0 2 
0 otherwise, 
Bartlett 1—|z| for |z|<1, : E i 2/3 
0 otherwise, 

Parzen |1 -— 62? +6 | z |? for 0 <| z |< 1/2, 

2(1- | z |)? for 1/2 <|z|1, 2 6 .151/280 

0 otherwise 

Quadratic] 23 ee = cos(6mz/5)] 277/10 6/5 2r 
Daniell (sin rz) /mz 2 1/6 1 
Tent alela) 21/12 4 


1 sin[(mp+ t ` 
Qn sin(\/2) 


1 (supa) 
r/2 


2rmt 


4 
3 sin(m7pA/4) 
Bam, ( Į sin(/2) ) 


- (22>)? bj als a/mr, 


T 


0, |à [> x/mr, 
mr/27r,| A |< 7 =mr, 


0, otherwise, 


mr(y— BL), | |< m/mr, 


0, otherwise. 
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where Tp(A) is the periodogram or sample spectral density, 

os (T-1) 

(A) = (2T) SS Qe’, 

s=—(T-1) 

and W(A, mr) is the averaging kernel. Because W (A, mr) is typi- 
cally concentrated about zero, we may think of Vr as a ‘view’ of the 
periodogram though a narrow ‘window.’ With these preliminaries, 
we may discuss the three spectral windows provided in Table 3.1. 


The simplest spectral window is the “rectangular” window pro- 
posed by Daniell [1946]. The Daniell estimator is a simple average 
of the periodogram ordinates over a band of width 27/m,, and thus 
the Daniell window can be thought of as the dual of the truncated 
window. The ‘truncated’ estimator weights autocovariances equally 
up to some truncation point, whereas the Daniell estimator weights 
the periodogram ordinates equally up to some truncation point. 


The tent window which (along with the rectangular) has been 
advocated by Doan and Litterman [1995], assigns linearly declining 
weights to periodogram ordinates in a band of (+r /mr) about zero. 
The tent window can be thought of as the dual to the Bartlett win- 
dow. The ‘tent’ assigns linearly declining weights to periodogram 
ordinates, whereas the Bartlett assigns linearly declining weights 
to the autocovariances. 


The final spectral window considered here, the Quadratic Spec- 
tral window, has a more complex form. This estimator was origi- 
nally derived by Priestly [1981], as the kernel which minimizes the 
average mean squared error. Andrews [1991] shows that this form 
also minimizes the mean squared error at the zero frequency. 


3.3.3 Computational Considerations 


At one time, the spectral approach was considered computationally 
burdensome. After all, the three ‘time-domain’ kernels in Table 
3.1 require only that the first mr autocovariances be computed 
whereas the three spectral estimators require all T autocovariances. 
Widespread use of the fast Fourier transform (FFT) algorithm 
for computing finite Fourier transforms has largely changed this 
assessment. 
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Define the finite (or discrete) Fourier transform of f as, 


tApt | 


Cp) = Toe fe" 


If N is highly composite (i.e., can be factored into small primes) the 
fast Fourier transform algorithm permits rapid digital calculation 
of the above expression. The periodogram matrix can then be 
computed at the Fourier frequencies \, = 2mp/T, p = 1,...,T/2, 
as the conjugate square of the finite Fourier transform, 


Tr(Ap) C(Ap)C(Ap)’- 
A good approximation to the integrals above can then be obtained 
from 


ae eee 
Ve do ITAW Op, mr) 
p=—[T/2| 


This approximation turns out to be exact for scale parameter fam- 
ilies with truncation parameter mr < (T + 1)/2. 


The approximation can be avoided, and exact estimates can 
be obtained from the spectral approach by summing over P > 
(2T — 1) periodogram ordinates. The exact spectral estimator can 
be obtained from, 


r= E RAWA, mr), 


where A, = 2mp/(2T — 1). The periodogram at these frequencies 


can be obtained by ‘padding’ {fh} with zeros at either end and 
proceeding as above. 


3.3.4 Spectral Interpretation 


Apart from any computational advantages (or disadvantages) the 
spectral representation of kernel estimators provides a number of 
immediate insights into the estimation problem. First, because the 
periodogram matrix is known to be positive semi-definite by con- 
struction, a necessary and sufficient condition to guarantee a pos- 
itive semi-definite estimate of Vr is that W (A, mr) be everywhere 
non-negative. It is clear that this condition is satisfied for all of the 
kernels listed in Table 3.1, apart from the truncated estimator. The 
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truncated estimator has negative ‘side-lobes’ (i.e., gives negative 
weight to periodogram ordinates removed from the zero frequency) 
and hence, depending on the shape of the periodogram, need not 
be positive semi-definite. In particular, the estimate may fail to be 
positive semi-definite if the periodogram has a relative minimum 
in the neighborhood of the zero frequency. 


The spectral approach clarifies the nature of the estimation 
problem. For stationary processes, the periodogram matrix at a 
particular frequency is well known to be an asymptotically unbi- 
ased estimator of the spectral density at that frequency. However, 
because the variance of periodogram ordinates does not approach 
zero, the periodogram is not a consistent estimator. To achieve 
consistency, we average periodogram ordinates in some band about 
the zero frequency. Because the elements of the periodogram matrix 
evaluated at the Fourier frequencies, A, = 27p/N,p = 1,...,N/2, 
are asymptotically independent, this averaging process dampens 
the variance. To achieve consistency, a) the number of independent 
frequencies must grow large, b) the underlying spectral density 
must be sufficiently smooth in the neighborhood of zero and c) 
the width of the frequency band must shrink. 


In finite samples, the choice of a bandwidth confronts us with 
a trade-off between bias reduction and variance reduction. If we 
choose a wide bandwidth (choose a small value for mr) our estima- 
tor averages a large number of periodogram ordinates and hence has 
low variance. Because this estimator uses frequencies far removed 
from zero, however, it may be badly biased if the true spectral 
density is not constant across frequencies. If we choose a narrow 
bandwidth (choose a large value for mr) the bias falls but, because 
we employ fewer independent periodogram ordinates, the variance 
of the estimator grows large. 


3.3.5 Asymptotic Properties of Scale 
Parameter Estimators 
We now establish the asymptotic behavior of scale parameter win- 


dows estimators of Vr. We establish the order of convergence of 
these estimators and their asymptotic bias, variance and mean 
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squared error. The results of this section rely heavily on Andrew’s 
[1991] treatment of the problem. 


As is evident from the discussion above, the behavior of window 
estimators will depend on the smoothness of the underlying spectral 
density and on the band width of the averaging kernel. Let r be 
the largest integer such that 


k, = lim et) 
z—0 | z| 
is finite and non-zero. Parzen calls r the “characteristic exponent” 
of the function k(z). If k, is finite for some rọ then it is zero for 
r < ro. The quantity k, may be thought of as a measure of the 
smoothness of the lag window and a measure of the bandwidth of 


the spectral window. Table 3.1 presents values of r and k, for the 
six representative kernels. 


A useful measure of smoothness of the spectral density function 
in the neighborhood of zero is S$“), defined as 


SO = (2r) DT IRG). 
j=- 
Parzen terms this measure the “generalized r** derivative” of S;(A). 
For r = 2, this is simply the second (ordinary) derivative of S'5(A) 
with respect to A evaluated at A = 0. 


To reduce the problem to one of scalars we consider a linear 
combination of the elements of Vr given by vec(Vp—Vy)! Brvec(Vp— 
Vr) where Br is a q? x q? (possibly) random weighting matrix. 
Because occasionally we encounter GMM parameter estimators 
that do not possess second moments, we establish the asymptotic 
truncated mean squared error of our estimators. 


DeFinit1ion 3.1 Truncated Mean Squared Error 
MSE,|(T/mr), Vr, Br] 
= Emin { 


T A 2 A = 
—vec(Vr ae Vr} Brvec( Vr m= Vr) 
Mr 


A). 


Following Andrews [1991] we employ the following additional as- 
sumptions on the {f,} process. 
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(i) {f+} is a mean zero, eight order stationary sequence 
of rv’s with > ||Q(j)|| < co and 


oo oo o0 , : 
2 ame Sais the Do als Kai asg (0, ji, wee Jz) < oO, 


where Ka,...a, is the eighth order cumulant of (farts ..-, 
fast) and fi is the ith element of fei. 

(ii) vec (2 f: — E& fe) is a mean zero, fourth order sta- 
tionary sequence 


ier E ie Kabed(0, j, m, n) < coVa,b,c,d. 


(iii) /T( — %) = O,(1). 

(iv) sup,s, E || fill? < 00. 

(v) sup,s; Esupgce ||Frl|? < 00. 

(vi) sup,s, E supgeg ||(0/00") Frl|’ < 00. 

| 

The cumulant conditions, (i) and (ii), are standard in the spectral 
density literature. Andrews shows that they are also implied by 
a strong mixing condition and a moment condition. Part (iii) is 
implied by asymptotic normality and parts (iv), (v) and (vi) are 
commonly used to obtain asymptotic normality of GMM estimates. 


THEOREM 3.1 


Suppose mr — oo and Br —> B, then we have (i) If 
m2./T — 0, then Vp — Vr => 0. (ii) If m2"+1/T > 
y € (0,00) for some r € [0,00) for which kn, ||S@|| < 
oo, then /T/mr(Vr — Vr) = O,(1). (iii) Under the 
conditions of part (ii), 
lim lim MSE,(T/mr), Vr, Br) 
h—oco T-00 


= 4r? (E (vecs OY ByvecS™ /y 
+ f” P@)dztr(B)L + By)S,(0) © S0) 
where Bj, is defined as }°7_, Èj- eve; Q eje; and e; 


is the q element zero vector with unity as the ith 
element. 
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Part (i) establishes the consistency of scale parameter covari- 
ance estimators for bandwidth parameter sequences that grow at 
rate o(T?), part (ii) gives the rate of convergence for these esti- 
mators and part (iii) gives the asymptotic truncated MSE of scale 
parameter covariance estimators. 


Theorem 3.1 was proved by Andrews [1991] under the assump- 
tions given here. Newey and West, Andrews and others show, how- 
ever, that under certain regularity conditions, Theorem 3.1 holds 
even if f; is a nonstationary process without a well-defined spectral 
density. Theorem 3.1 establishes a class of consistent estimators of 
Vr which can be used to construct asymptotically valid GMM test 
statistics and to calculate an asymptotically efficient estimator in 
over—identified systems. 


Theorem 3.1 also can be used to establish the asymptotic mean 
and variance of individual elements of Vr. The jth diagonal element 
of Vr has asymptotic bias, —m7"k,27 oo and asymptotic variance 


(mrn | k?(z)dz. 


EXAMPLE 3.4 


Consider the p = 1 case so that Vr, Vr and S are scalars. 
Suppose further that {f,} is homoskedastic and follows 
an AR(1) with coefficient p = .80 and variance of = 1. 
Then our estimand Vr = 2m S;(0) = Elp 
(1+ p)(1— p) = 9.0 . We also have that ans) = 
2p(1 — p)? and ans? = 2p(1 + p)(1 — p). (i) For 
a Newey—West—Bartlett window we have r = 1, ky = 1 
and fk? = 2. The asymptotic bias is 


—mz'2p(1 — p)? = —40mz', 


the asymptotic variance is 


CDIA + p)/(1 — p) = 10872), 


and the asymptotic mean squared error is 


1600mz? + 108(—). 
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(ii) For a Tent Window we have r = 2, k = 4 and 
fk? =4. The asymptotic bias is 


zaz = E 
-mr (75) + p)(1 - p)* = -15m7°, 


the asymptotic variance is 


POD + 9)/(A -A = 6s F 


and the asymptotic mean squared error is 
mr 


225m7“ + 648( T 


3.4 Optimal Choice of Covariance 
Estimator 


Theorem 3.1 establishes the consistency of a wide class of kernel 
estimators. The result, however, provides no guidance on which 
particular kernel estimator to choose, nor how to choose the band- 
width parameter. Further, kernel estimators are only one of a num- 
ber of plausible estimation strategies. In this section we attempt 
to provide some guidance for the choice of estimator. 


3.4.1 Optimal Choice of Scale Parameter Window 


One generally accepted criterion for discriminating among covari- 
ance matrix estimators is that the estimator should always deliver 
positive semi-definite estimates. Negative variance estimates are 
clearly not sensible. More importantly, many iterative techniques 
for calculating the optimal GMM estimator require a positive def- 
inite Vr in each iteration. Restricting attention to positive semi- 
definite estimators eliminates the truncated window, but leaves a 
large class from which to choose. 


The asymptotic results of Theorem 3.1 suggest that, all else 
equal, we should prefer estimators with a large r. The variance 
of these kernel estimators is of order 7 and the bias is of order 
m7’. Asymptotically then, estimators with larger r will tend to 
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dominate according to a MSE criterion. On the other hand, no 
kernel estimator with r > 2 can be positive semi-definite. (See 
Priestley [1981, p. 568.]) This suggests that we restrict attention 
to estimators with r = 2 which rules out the Bartlett and truncated 
kernels. 


Within the class of scale parameter windows with r = 2, 
Priestley shows that the Quadratic Spectral window minimizes the 
maximum relative mean square error across the spectral density. 
Andrews [1992] shows that the Quadratic Spectral window also 
minimizes the truncated mean square error at the zero frequency. 
Thus, according to the MSE criterion, the Quadratic Spectral win- 
dow appears to be a good choice. 


It should be noted, however, that the optimality of the spec- 
tral window is not as clear-cut as the above results appear. The 
asymptotic optimality need not obtain in finite samples and, as 
we discuss below, the MSE criterion is not terribly compelling in 
this context. Newey and West’s [1994] Monte Carlo work finds 
only minor differences between the performance of the Quadratic 
Spectral window and their preferred estimator. The choice of a 
scale parameter turns out to be of much greater importance, in 
practice, than the form of the window. 


3.4.2 Optimal Choice of Scale Parameter 


Theorem 3.1 tells us that, from an asymptotic MSE criterion, the 
scale parameter should be O(T D ). This suggests that a form, 
mr = yT E , may be appropriate. Hannan [1970] notes that this 
suggestion is ‘obviously’ useless because we do not know the value of 
y. Unfortunately, statistical inference is often very sensitive to the 
choice of mr. Some method of selecting mr is clearly desirable, but 
ultimately, the appropriate choice requires some prior knowledge or 
additional restrictions on the underlying process, {f}. 


Several ‘judgmental’ approaches for selecting the scale para- 
meter have been suggested. For kernel estimators of the truncated 
sort (the truncated, Bartlett and Parzen), Priestly suggests a visual 
examination of the autocovariance function may be helpful. The 
idea is that because mr is the highest order autocorrelation con- 
sidered, we should use for mr the value such that Q(s) ~ 0, where 
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|s| > mr. This approach has a number of obvious drawbacks. 
One problem is that the number of moment conditions, q, may be 
large. Visually inspecting the q autocorrelation functions at each 
iteration of each estimation does not appear to be practical. In 
addition, the approach is only consistent for the truncated kernel. 
For the Bartlett and Parzen, mr must grow large, even if the true 
autocorrelation function is truncated at some known, finite point. 


Andrews [1991] suggests an approach that avoids a purely 
judgmental strategy. Suppose we are interested in minimizing the 
MSE of some linear combination of the elements of Vr, B vec(V;), 
where B is a q? x q’ non-stochastic matrix. The optimal scale 
parameter can be obtained by letting mr grow at the optimal rate 
and then differentiating the result of Theorem 3.1 with respect to 
y and setting the result equal to zero.. This yields, 


mj = (gkta(a)T/ f k (jaz 


? 


1 
, 


where 
2(vecS Y B vecs™ 
oE tr 


B(I + Byp)Ss(0) @ S5(0)’ 
and By, is the Q? x q? commutation matrix of B. 


The problem, of course, is that Ss(0) and S“ are unknown. 
Andrews’ suggestion is to estimate S;(0) and S“) using a simple 
parametric model, say a low order VAR and then using the esti- 
mated values to obtain an estimate of the optimal mr. He shows 
that under suitable regularity conditions, this two-step procedure 
is asymptotically efficient, even if the parametric model is misspec- 
ified. 


Newey and West [1996] suggest a similar approach, but they 
suggest a nonparametric technique for estimating S and S$“). They 
point out that S and S$“) can be estimated using a kernel estimator 
and they propose the simple, truncated estimator for this first— 
round estimation. Of course, their suggestion requires choosing 
some scale parameter, say ‘n’, to obtain the first-round estimates. 
They suggest exercising some ‘judgement’ on the sensitivity of re- 
sults to the choice of n by increasing or decreasing the parameter 
and they conjecture that statistical inference will be less sensitive 
to the choice of ‘n’ than to the choice of mr. 
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The great advantage of the approaches suggested by Andrews 
and by Newey and West is that they remove some of the judgmental 
aspects of selecting the scale parameter. Some method of prevent- 
ing researchers from selecting the value of mr that best supports 
their case is clearly desirable. 


The optimality of the Andrews and the Newey—West proce- 
dures, however, should not be overemphasized. The choice of a 
low-order parametric model (Andrews) and the choice of a partic- 
ular first-round scale parameter (Newey—West) implicitly imposes 
strong priors on the smoothness of the underlying spectral density. 
In addition, the MSE criterion for selecting mr, though standard in 
the spectral density estimation literature, is not terribly compelling 
in the present context. After all, we are rarely interested in esti- 
mates of the covariance matrix per se. Instead, we are interested 
in the behavior of test statistics that use this estimated covariance 
matrix. It is by no means clear that the one-for-one trade-off 
between squared bias and variance, implicit in the MSE criterion, is 
appropriate. Indeed, Monte Carlo studies suggest that the behavior 
of t-statistics is far more unfavorably affected by bias than by 
variance in estimated covariance matrices. 


3.4.3 Other Estimation Strategies 


The estimators we have reviewed so far are weighted averages 
of the autocovariances or, equivalently weighted averages of the 
periodogram ordinates. As such, they are quadratic functions of 
the underlying observations. Quadratic estimators represent but a 
small class of possible estimation strategies. 


den Haan and Levin [1996] advocate applying a purely para- 
metric approach to estimating Vr. Their suggestion, in essence, is 
simply to use the estimates derived from Andrews’ [1991] first- 
round parametric estimation. That is, they suggest estimating 
S(0) and hence Vr directly from a low order VAR or a sequence of 
univariate ARMA models. This approach can by made “automatic” 
by fitting a VAR and choosing the order according to some lag 
length selection criterion. In the spectral literature, this approach 
has been advocated by Akaike [1969] and Parzen [1974]. 
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The relationship between the parametric approach and the 
kernel estimators is best understood by noting that fitting ARMA 
models in the time domain can be interpreted as fitting a rational 
polynomial in e~” to the periodogram ordinates in the frequency 
domain, see Hannan [1969]. Thus the parametric approach in- 
volves globally smoothing the periodogram ordinates using a ra- 
tional polynomial whereas the kernel or window approach involves 
local smoothing with linear filters. Global smoothing by rational 
polynomials relies on the prior belief that the underlying spectral 
density has a particular shape, so that high frequency components 
of the periodogram provide information about the value at the zero 
frequency. 


The ‘prewhitening technique’ developed by Press and Tukey 
[1956] and advocated in the present context by Andrews and Mon- 
ahan [1992] is an appealing compromise between parametric and 
nonparametric approaches. The first step calls for fitting a para- 
metric model as above. For a n-th order VAR, this step calls for 
estimating the model, 


hebat bha tat Of 


Because the VAR. is only an approximation, the residuals from 
this regression, &, need not be white noise. The next step is to 
estimate S>, the spectral density matrix of the residual process 
at the zero frequency. This may be accomplished by any of the 
kernel estimation approaches discussed above. The benefit of this 
approach is that we may apply a fairly wide bandwidth kernel 
estimator. This wide bandwidth estimator has low variance and, 
because the prewhitening technique has approximately flattened 
the spectral density, may have small bias. The final estimate can 
then be obtained by ‘recoloring’ using the estimated coefficients 
from the VAR, 

A An as TEE 

Pow = {In ~ 8, -$2 n} So 


e 


The Monte Carlo results in Andrews and Monahan [1992] suggest 
that this prewhitening estimator is robust to different underlying 
error structures and generally outperforms standard kernel estima- 
tors. 
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3.5 Finite Sample Properties of HAC 
Estimators 


In this section we examine the finite sample properties of HAC 
estimators in scalar linear regression models. The appeal of us- 
ing these estimators in the construction of OLS test statistics is, 
of course, their robustness to heteroskedastic and autocorrelated 
errors of unknown form. Unfortunately, although inference based 
on the resulting test statistics is asymptotically valid, Monte Carlo 
evidence suggests that the actual size of the test statistic is often far 
above its nominal size. We present some Monte Carlo results which 
show the poor small sample performance of HAC estimators and 
then present some small sample corrections which can substantially 
improve this performance. 


3.5.1 Monte Carlo Evidence 


We conduct seven experiments to give the reader a flavor of the 
finite sample properties of HAC estimators used in linear regres- 
sion models. The population models are based on those used by 
Andrews [1991], where the population regression in each case con- 
tains a constant and three explanatory variables with an indepen- 
dently distributed mean zero error term. The vector of explanatory 
variables is a mean zero AR(1) normally distributed process with 
identity covariance matrix. In six of the seven experiments, the 
error is also an AR(1): in three of the six cases, its variance is one, 
and in the other three cases, the AR(1) unit variance process is 
multiplied by the absolute value of the second explanatory variable 
so that the error is heteroskedastic. We follow Andrews and use 
the same values of p (the AR parameter equals 0, .5, and .9) for 
the error and explanatory variables so that, in the homoskedastic 
error case, the process (zu:) is an AR(1) with parameter po. 
The seventh experiment generates an AR(2) unit variance error 
process with AR parameters, 1 and -.8. In this population the AR 
parameter for the explanatory variables is .9 so that the spectral 
density of (£us) is not concentrated at the zero frequency. In 
each experiment we calculate the covariance estimate of the OLS 
coefficient estimator and the z-statistic of the coefficient on 22. 
The tables report the mean and MSE of the variance point estimate 
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(scaled by the sample size) as well as the confidence levels of the test 
statistic under the true null that the coefficient on zz is zero. We use 
the standard White heteroskedastic consistent covariance estimator 
(White), the Newey—West estimator (NW) with the bandwidth 
3.62 (= 4(T/100)?/°), NW with Andrews’ automatic bandwidth 
(NW-A), and the prewhitened estimator based on a VAR(1) using a 
quadratic spectral estimator with Andrews’ bandwidth (PW-Q-A). 


The results in the Tables 3.2-3.8 show that the true confidence 
levels associated with the test statistics are almost always lower 
than their nominal levels. The difference between the nominal and 
actual levels increases as the degree of positive autocorrelation in 
the error process increases, with the actual confidence levels being 
lowest when the error is heteroskedastic as well as highly positively 
autocorrelated. 


Table 3.2: Monte Carlo Results With T = 64 
Homoskedastic Errors, p = 0; T V (Ger) = 1.078 


(Ye = Bo + Eii bizit + Us, Ue = Put- + Et, Lit = pita + Nit) 
Confidence levels 
90 95 99 E(TV(Gor)) MSE(TV(or)) 


White 867 .925 .978 972 .669 
NW .853 .911 .969 927 167 
NW-A 863 .921 .976 .963 134 
PW-Q-A | .859 .918 .972 975 .186 


Because our ultimate goal is to conduct (reliable) inference on 
4, we would like to find a way to adjust the estimated OLS test 
statistic for finite sample bias in Vr or to calculate a finite sample 
size correction. In the special case of iid normally distributed 
errors, the finite sample bias in Vr is removed by the correction 
factor op? and the exact finite sample distribution of the test 
statistic is known to be F(p,T — p). Correcting for finite sample 
bias or finding the exact finite sample distribution of the OLS test 
statistic is more difficult, however, under more general assumptions 
on the error term. For this reason, a more common approach has 
been to adjust Vr for small sample bias and hope that the OLS 
test statistic calculated with this bias-adjusted estimate will more 
closely approximate its asymptotic distribution. 
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Table 3.3: Monte Carlo Results With T = 64 
Homoskedastic Errors, p = .5; T V (Ger) = 1.722 
(ye = Bo + ear biLit + Ut, Ue = PUr—1 + Et, Lit = PLit—1 + Nit) 

Confidence levels 
90 95 99 E(TV(Gor)) MSE(TV(Bor)) 


White -761 .838  .932 .960 1.377 
NW -796 .867 .945 1.171 .653 
NW-A 796 .867 .945 1.174 671 
PW-Q-A | .838 .897 .962 1.493 847 


Table 3.4: Monte Carlo Results With T = 64 
Homoskedastic Errors, p = .9; T V (Gor) = 7.1136 


(y: = bo + D Bizit + Ut, Ut = Ptt- + Et, Lit = PLit-1 + Nit) 
Confidence levels 
90 .95 .99 E(T V(Gor)) MSE(T V(Ber)) 


White 421 .494 610 .948 43.328 
NW .554 .636 .751 1.895 29.888 
NW-A .570 .649 .765 2.870 29.184 
PW-Q-A | .765 .827 .899 % ii 


Note: * The moments of the quadratic spectrum covariance estimator do 
not exist in this case because the prewhitening filter was estimated from an 
unrestricted VAR(1) and the population AR(1) parameter is close to one. 


Table 3.5: Monte Carlo Results With T = 64 
Heteroskedastic Errors, u; = vzal, p = .9; T V (Ger) = 15.810 
(y: = Bo + ea Bilin + Ut, Ve = PUt-1 + Et, Lit = PLTH-1 + Nz) 
Confidence levels 
90 95 99 E(TV(Gor)) MSE(TV(or)) 


White 375 438 557 1.565 204.59 
NW 490 .568 .699 3.080 170.10 
NW-A 509 .586 = .714 3.525 164.58 
PW-Q-A | .648 .721 .824 7 j 


Note: As above. 
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Table 3.6: Monte Carlo Results With T = 64 
Heteroskedastic Errors, u; = v;|L2|, p = 0; T V (Ger) = 2.9176 
(ye = Bo + De BiZit + Ut, Ve = PUt-1 t+ Et, Lit = PLit-1 + Nit) 


Confidence levels 
90 95 99 E(TV(Gor)) MSE(T V(Gor)) 


White .841 .903  .967 2.442 1.736 
NW .826 .889 .961 2.339 2.334 
NW-A .835 .898  .965 2.409 2.157 
PW-Q-A | .832 .894  .963 2.433 2.389 


Table 3.7: Monte Carlo Results With T = 64 
Heteroskedastic Errors, u; = v,|@2|, p = -5; T V (Bor) = 4.535 
(y: = Bo + 4 Bilin + Ue, Ve = PVt-1 + Et, Lit = PLit-1 + Nit) 


Confidence levels 
90 .95 99 E(T V(Ger)) MSE(T V(Ber)) 


White .738 .816 .917 2.265 6.044 
NW 765 .837 = .927 2.752 6.495 
NW-A .766 .842 .930 2.763 6.390 


941 3.474 120.050 


Table 3.8: Monte Carlo Results With T = 64 
Homoskedastic Errors, p = .9; T V (Gor) = 2.079 
(ye = bo + D Bizit + Ut, Ut = U1 — Bro + Et, Lit = PZit—1 + Mit) 


Confidence levels 
90 95 99  E(TV(Bəar)) MSE(TV(Bzr)) 
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3.5.2 Regression Model 


Suppose we wish to estimate the p x 1 parameter vector @ in the 
following model: 
Y= Tib + Us, 


where u; is a mean zero stationary random variable. We assume 
that {a}, t = 1,2,...,7', are fixed in repeated samples. (Alter- 
natively, we could condition on the observed values of x+.) The 
parameter vector, (3, is identified by the following q(= p) moment 
conditions: 


E(x) = E(f) = 0, [aoe yer! h 


The OLS estimator Br satisfies the sample moment conditions 
eis ZU, = 0 so that Br = (X'X)-t X'y, where X and y are the 
observation matrices stacked in the usual way and rank(X) = p. 
Asymptotic inference on ĝ is based on 


VT (br — B) ++ N(0,Mz1VM;") 


where V = Jim T Diaa Dar SE (urts)€, = D7- E(fefi_s) is 
2r times the spectral density of the process f; = £ru+, evaluated at 
the zero frequency and plim(X’X/T) = X'X/T = M,. 


We will consider small sample corrections for estimates of V un- 
der three different assumptions on the properties of the error term 
in the above regression model. The first case is a heteroskedastic 
model with uncorrelated errors, the second case allows autocorrela- 
tion but no heteroskedasticity, and the third case allows both auto- 
correlation and heteroskedasticity. Each case will imply a specific 
form for V which will guide the choice of consistent estimator. 


3.5.3 Heteroskedastic but Serially 


Uncorrelated Errors 


Many times it is reasonable to assume that the error term in a cross- 
sectional regression model is conditionally heteroskedastic, but is 
uncorrelated across observations. That $; suppose that E(u,u,) = 
0, for t T: 8, and Elite) = = ø(x;) = o? for t = s. In this case 
V = T Y} 2,022", which is typically estimated by White’s [1980] 


heteroskedasticity consistent estimator, Vr = T7 1 rr, = 
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HCr. MacKinnon and White [1985] propose three asymptotically 
equivalent estimators of V which adjust HC for its use of residuals 
instead of the true errors. The authors then compare the behavior 
of test statistics using the adjusted covariance estimates to that of 
the traditional OLS test statistic and the test statistic using HCr. 


The first small sample adjustment MacKinnon and White con- 
sider is the standard degrees of freedom correction, T/(T—p). They 
define the corrected estimator as 


HCir = (T/(T — p))HCr 


Under homoskedasticity (0? = ø? for all t), when the estimand is ø?, 
the scale factor T/(T — p) corrects the estimator 67, = T~! D7 @? 
for bias. When the estimand is V = T~! $1; 2,0?2/, however, the 


scale factor T/(T — p) usually does not eliminate the small sample 
bias of HCr. 


Substituting @ = (u,— 2,(X'X)~!X'u) for the OLS residual 
we see that, under homoskedasticity, 


E(@) = o°(1 = x(X'X)7*z,) = o’ (1 — Pu), 


where pu is the t’th diagonal element of X(X’X)~!X’. So the 
corrected estimator HC'1r is unbiased only if pu = p. 


This result suggests a second bias-adjustment to HCr which 
Horn, Horn and Duncan [1975] propose and MacKinnon and White 
define as, 


T 
HC2r = T7! we Lure, 


where &? = @/(1 — pu). This estimator is, of course, unbiased 
under homoskedasticity. 


MacKinnon and White note that each of the three estimators of 
V, when pre- and post-multiplied by (X’X/T)~’, is a type of jack- 
knife estimator of the covariance matrix of VT Br. They therefore 
suggest as their final estimator that part of the ordinary jackknife 
estimator (see Efron [1982]) which corresponds to an estimate of 
V. They define this estimator as, 
HC3r = eae baw LUZ x, — Txata” x] ; 


where uf = @4/(1 — pu). 


MacKinnon and White perform Monte Carlo experiments to 
compare rejection frequencies of quasi t-statistics calculated using 
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HC1r, HC2r, and HC3r in a three parameter regression model 
under both homoskedasticity and two types of heteroskedasticity. 
The estimated rejection frequencies show that HC3r dominates the 
other two heteroskedasticity consistent estimators of The Monte 
Carlo results also indicate that the standard OLS test statistic 
substantially over rejects the null hypothesis when the error is 
heteroskedastic; in these cases, the test statistics using HC3- also 
over reject, but by a smaller amount. In the case of homoskedastic 
errors, the standard OLS test statistic performs the best, although 
the actual size of test statistics calculated using HC3,r is not far 
from its asymptotic size. 


3.5.4 Autocorrelated but Homoskedastic Errors 


It is often reasonable to assume when estimating a regression 
model with time series data that the error term is autocorre- 
lated but conditionally homoskedastic. In our notation, this as- 
sumption can be stated as E(uu,) = Os, for t # s, and 
E(uus) = o°, for t = s. In this case the estimand of interest 


= . 2 
becomes V = 2r f7; Jim T~? es ne™| [sa a weighted 


integral of the spectral density of u,, where pe xe 


SL, EZ we C-*), With the explanatory variables fixed in 
repeated samples (or, conditioning on the sample observations), the 


2 
weight in the integral becomes T~! baa ne™| , a known, finite 
p x p nonsingular matrix. 


The assumption of homoskedasticity greatly simplifies the es- 
timation of V. Because the estimand is a weighted integral of the 
spectral density of u, and the weights are finite, the weighted sum 
of the periodogram ordinates of the OLS residuals provides a con- 
sistent estimator of V. One need not use a kernel spectral density 
estimator of S,,(\) to form a consistent estimator of V. Cushing 
and McGarvey [1997b] examine the finite sample properties of test 
statistics based on this type of autocorrelation consistent, weighted 
periodogram estimator of V. We consider both an unadjusted and 
a bias-adjusted weighted periodogram estimator. 
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The unadjusted autocorrelation consistent estimator of V is 
T T 
ACI1p = 2m / r 5 jae Ta r(A)aA, 
t=1 


=T 


where fa r(à) = (21T SSL] Ge™ “To find an approximate 
bias adjustment factor for this estimator we consider the case in 
which the OLS estimator Br is an asymptotically efficient estimator 
of 8. Anderson [1971] gives the asymptotic bias of the periodogram 
ordinates based on OLS residuals in this special case. 


Jim E(fur(d)) ~ Sud) = -SANEA XE), 


T 2 

where %p(A) = >> z,e7™* and transposition also denote complex 
t=1 

conjugation. 


The above result suggests a simple bias adjustment to the 
estimated periodogram. The adjusted autocorrelation consistent 
estimator is 


TOAT T 
AC?2r =2n | > mo jere] Ta r(à)/ 


(1 — %p(A)(X'X)"F,(A))dd. 


Cushing and McGarvey perform a small scale Monte Carlo study 
to compare the rejection frequencies of OLS test statistics based 
on covariance estimates using AC1lr and AC2r to test statistics 
based on the Newey—West estimator and to the standard OLS 
test statistic. Our preliminary results suggest that when u; is 
homoskedastic and u; and x; are positively autocorrelated, one is 
better off using either of the autocorrelation consistent estimators 
AC 1r and AC2r than the HAC Newey—West estimator. Using the 
bias adjusted estimator, AC27, to construct test statistics appears 
to substantially reduce the discrepancy between the test’s actual 
and nominal size. For example, in a model with p = 5, T = 128, 
and both 2; and u; with first-order autocorrelation coefficient of .9, 
testing at a nominal level of .05 results in an actual size of .49 using 
the standard OLS test statistic. Using the Newey—West estimator 
to calculate the test statistic reduces the rejection frequency to .29 
and using the correct parametric estimator reduces the size further 
to .14. The test statistic based on our adjusted autocorrelation 
consistent estimator, however, rejects the true null 12% of the 
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time, even closer to the nominal 5% level than that found using 
the parametric estimator. 


More extensive Monte Carlo analysis is needed before we can 
reach general conclusions about the benefits of using autocorrela- 
tion consistent estimators such as AC1r and AC2r. In particular, 
a comparison of these test statistics’ finite sample performance 
needs to be made under heteroskedasticity. Our preliminary results 
suggest, however, that if the model is homoskedastic, a covariance 
estimator which incorporates that assumption will outperform a 
more robust HAC estimator in finite samples. 


3.5.5 Autocorrelated and Heteroskedastic Errors 


The final case we consider is the general stationary error, which 
is conditionally heteroskedastic, E(u,u;) = o(a:) = o? for t = s, 
and autocorrelated, E(urus) = C-s, for t # s. This is the case 
for which kernel spectral density estimators of V are ideally suited. 
The estimand is unrestricted; it is merely 27 times the spectral 
density matrix of zu, evaluated at the zero frequency. 


It is well known that the kernel spectral density estimator, 
Pr, is a biased estimator of the spectral density of xu; evaluated 
at frequency zero and that test statistics based on this estimator 
suffer from size distortions. As discussed earlier, Andrews [1991] 
and Andrews and Monahan [1992] suggest using an estimate of the 
optimal lag truncation parameter and prewhitening the residuals to 
both reduce the mean-squared-error of the covariance estimates and 
to improve the performance of confidence intervals. Neither of these 
suggestions, however, explicitly takes into account the estimator’s 
use of residuals instead of true errors. Cushing and McGarvey 
[1997a] attempt to correct Vr for bias due to its use of residuals 
and suggest a degrees of freedom adjustment to the resulting test 
statistic. 


To see the problem introduced by using residuals, recall that 
the HAC estimator of Vis a weighted integral of the estimated 
periodogram of Lut, 


= / W(A, mr)Îr(A)dÀ, 
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where I7(A) = (rT) S2, DZ raet, Because the 
OLS estimator satisfies the sample moment conditions, Y/_, 2%; = 
0, the estimated periodogram is constrained to be identically zero 
at the zero frequency. Thus, even asymptotically, there is a bias in 
the periodogram estimates from using residuals. 


To calculate the bias in Vr due to using residuals instead of 
the true errors, we assume the regression error is conditionally 
homoskedastic and uncorrelated across observations. In this case, 


E(Ir(d)) = E(Er(d)) - Dr(A, E(Ir(0)), 
where Ir(à) is the periodogram of {x,u,} and the difference Dr(A, 
E(Ir(0)) is 


T-T 
Dr(à, Ir(0)) = X X ra, (X' X) TOX X) rT, . 
t=1 s=1 
Note that there is no asymptotic bias in the periodogram estimates 
from using residuals except, of course, at the zero frequency, for 
jim E(Ir(A) — Ir(A)) = —Seu(A), for A = 0 and = 0, for A F 0. 


Since as T — oo, the point estimator Vr is a weighted average of 
T7(A) over the entire interval —zx to 7, using residuals to estimate 
Szu(0) is asymptotically equivalent to using the true errors. In 
sample sizes typical of economic quarterly data, however, E(Ip(A) — 
Ir(à)) will be large at frequencies close to zero if x; is positively 
autocorrelated and, because Vr isa weighted average of a finite 
number of frequencies, in practice this difference can create a large 
bias in Vr. 

Our adjusted HAC estimator, based on the result above, re- 
duces the bias in Vp due to using residuals and eliminates this bias 
when the regression error is white noise. The adjusted estimator is 


vecVj4 = (Ip — A)7! - vecVr, 
where 
A= J WAY Y OM (wel (XX) 
i Q ((X'X) z.z! )dàÀ. 
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This estimator exists if the eigenvalues of A are less than one in 
absolute value, a condition which can be checked easily. If this 
condition is true, then 


vecVf = -vecVr. 


ya 
j=0 


If we restrict Pr to be a positive semidefinite HAC kernel estimator, 
then each term in the above summation is the vectorization of a 
positive semidefinite matrix. Thus, the bias—adjusted estimator, 
VA, is positive semidefinite. 


Cushing and McGarvey use a small scale Monte Carlo study to 
compare the effects of correcting the Bartlett-Newey—West estima- 
tor by the standard bias correction T/(T — p) to correcting by the 
method given above. The experiments paralleled those performed 
by Andrews [1991] with a five parameter, orthogonal regressor 
model with x; and u, having identical first-order autocorrelation 
coefficient, p, equal to 0, .5, .7, .9 and .95, and T = 128. We find 
that using VA rather than (T/T — p) - Vr to estimate the variance 
of an element of B results in a smaller bias and MSE across all 
window widths considered. When comparing the two estimators’ 
MSEs across only optimal window widths, our bias adjusted version 
of the Newey-West estimator provides between a 16 % and 27 % 
lower MSE than the uncorrected version. 


More relevant than the bias and MSE of the estimated vari- 
ance of 8; is the behavior of the quasi t-statistic and estimated 
confidence intervals for 8. The Monte Carlo results of Andrews 
and others suggest that, even if one adjusts Pr by (T/T — p) to 
calculate standard errors for B, with highly autocorrelated x, and 
uz, the actual confidence levels are much smaller than their nominal 
levels. One reason that the standard degrees of freedom correction, 
T/(T — p), does not eliminate the bias in Vr due to using residuals 
is that, when estimating the spectral density at zero, it is not p, 
the number of estimated parameters, which is relevant, but rather 
mr, the number of covariances included in the weighted estimator. 
Therefore, we suggest not only using VAto calculate the standard 
error of B; but also basing confidence intervals and hypotheses tests 
on the ¢ distribution with degrees—of-freedom adjusted for using 
residuals. 
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To calculate the adjusted degrees of freedom, we first note that 
the ith diagonal element of the Bartlett estimator of S,,(0) using 
the true errors is asymptotically Chi-squared. We approximate 
its degrees of freedom by 1.5(T'/mr), two times the ratio of the 
estimator’s asymptotic mean to its asymptotic variance, where the 
variance is calculated under the assumption that x,u,; is white noise. 
We then scale 1.5(T/mr) by the inverse of C'R, our bias adjustment 
factor, where CR = trace(VA)/trace(Vr). Our degrees of freedom 
adjusted confidence interval thus uses VA to calculate the standard 


error of ĝ; and the t value corresponding to degrees of freedom, 
CR . 1.5(T/mr). 


In our Monte Carlo experiments we calculated confidence in- 
tervals based on VA and the degrees of freedom corrected t value as 
well as standard confidence intervals using Vp - T /(T — p) and the 
z value. Our adjusted confidence intervals work remarkably well 
compared to the standard procedure. The improvement is most 
dramatic for moderately to highly autocorrelated x,u just as is the 
improvement in bias and MSE. For example, with p = .95, the true 
confidence level corresponding to a nominal 95 % interval using the 
standard procedure was estimated to be 62 %. Using our procedure 
it was estimated to be 94 %. With p = 0, however, both methods 
produced correct confidence levels. 


3.5.6 Summary 


Monte Carlo evidence indicates that the standard HAC kernel spec- 
tral density estimators of Vr are severely downwardly biased in the 
presence of positive autocorrelation. Moreover, quasi t-statistics 
appear to reject a true null far more often then their nominal 
size. The results reviewed in Section 3.5 suggest that one can 
reduce this bias and improve the finite sample performance of 
test statistics in linear regression models by adjusting the estimate 
of Vr for small sample bias. These adjustments alone appear to 
work very well if one can reasonably assume a model with either 
heteroskedasticity with no autocorrelation or autocorrelation with 
no heteroskedasticity. The first assumption is common in cross 
section regression models and the second is often made in time 
series regression models. 
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In the more general case of both conditionally heteroskedastic 
and autocorrelated errors, it appears that a bias correction to Vr 
alone will not sufficiently improve the behavior of estimated test 
statistics or confidence intervals on 8. Two approaches suggested 
in the literature appear promising in this case. Andrews and Mon- 
ahan [1991] examine the behavior of estimated confidence intervals 
in regression models using Vr where the optimal value of mr is 
estimated after pre- whitening the x, us process. Their Monte 
Carlo evidence suggests this method can substantially improve the 
behavior of the resulting test statistics. The second approach is 
that suggested by Cushing and McGarvey [1997]. Our preliminary 
results suggest that using a bias adjusted version of Vr in conjunc- 
tion with a degrees of freedom corrected t critical value has the 
potential to dramatically improve finite sample inference. 
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Chapter 4 


HYPOTHESIS TESTING IN MODELS 


ESTIMATED BY GMM 


Alastair R. Hall 


Since its introduction the Generalized Method of Moments (GMM) 
has had considerable impact on the theory and practice of econo- 
metrics. For theoreticians, the main advantage is that GMM pro- 
vides a very general framework for considering issues of statistical 
inference because it encompasses many estimators of interest in 
econometrics. For applied researchers, it provides a computation- 
ally convenient method of estimating nonlinear dynamic models 
without complete knowledge of the probability distribution of the 
data.! All these applications have a common structure. An eco- 
nomic and/or statistical model implies a vector of observed vari- 
ables, x,, and an unknown parameter vector, 99, satisfy a vector of 
population moment conditions, 


E[f (z+, 80)] = 0. (4-1) 


For the model to be of practical use, it is necessary to find a suitable 
value for the parameter vector. GMM provides a simple framework 
by which the information in (4-1) can be combined with a sample 
on x; to provide an estimate of 0). Within this general structure, 
three broad questions naturally arise: Is the estimation based upon 
correct information? Does the parameter vector satisfy a set of 
restrictions implied by economic theory? Which of two competing 
models is correct? This chapter reviews various hypothesis testing 
techniques which have been designed to answer these questions. 


1 See Hall [1993] and Ogaki [1993] for an overview of these applications. 
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The population moment condition must possess certain prop- 
erties for GMM estimation to be feasible. At this stage, we need 
only note the requirement that q, the dimension of the population 
moment condition, is at least as large as p, the dimension of the 
parameter vector. In nearly all the applications refered to above, 
q exceeds p and we shall maintain this assumption for most of 
our discussion because this introduces the unique features of the 
GMM framework. By its very nature, GMM induces a fundamental 
relationship between the parameter estimates and the population 
moment conditions upon which they are based. This relationship 
limits the types of hypotheses which can be tested, and also implies 
certain important properties which play a crucial role in the analysis 
of the various statistics discussed below. Therefore in Section 4.1, 
we briefly describe the basic structure of GMM estimation. Particu- 
lar attention is paid to the decomposition of the population moment 
condition into the identifying restrictions and the overidentifying 
restrictions. The identifying restrictions represent the part of the 
population moment condition which actually goes into parameter 
estimation; the overidentifying restrictions are just the remainder. 
In the sample, the analog to the identifying restrictions are satisfied 
at the estimator, 97, and so it is not possible to test whether the 
identifying restrictions hold at the true parameter value. However, 
the sample analog to the overidentifying restrictions are not im- 
posed and so it is possible to test whether these restrictions hold 
in the population. Most importantly, the two components of this 
decomposition are orthogonal and this property plays an important 
role in the discussion. 


In practice, it is prudent to begin by testing the overidentifying 
restrictions because if this hypothesis is rejected then this indicates 
the model is misspecified. In such an eventuality, the natural course 
is to re-examine the specification rather than proceed to examine 
other types of hypotheses within the current model. Therefore 
our review begins in Section 4.2 with a description of a statistic 
for testing the overidentifying restrictions and a discussion of its 
properties. If the null is rejected, then it is useful to diagnose the 
source of the problem and Section 4.3 describes statistics for testing 
hypotheses about a subset of the moment conditions. Section 4.4 
discusses methods for testing the hypothesis that the parameter 
vector satisfies a set of nonlinear restrictions of the form r(@)) = 0. 
These types of restrictions naturally arise in many economic models 
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and so test results can often provide useful insights about the 
underlying economic structure. It is shown that these hypotheses 
can be expressed in terms of the identifying restrictions and this 
perspective provides a useful interpretation of this type of test. 


One of the main assumptions behind GMM is that the pop- 
ulation moment condition in (4-1) holds throughout the entire 
sample; in other words the model is assumed to be structurally 
stable. A natural concern is that (4-1) may only hold for part of 
the sample and so the model exhibits structural instability. Section 
4.5 describes various methods for testing structural stability. The 
differences between the tests are most easily understood by consid- 
ering their sensitivity to instability of identifying and overidentify- 
ing restrictions separately. It is also shown how this decomposition 
can be exploited to develop tests which can distinguish between 
instability in the parameters alone and instability of a more general 
form. 


In most cases, a number of models have been proposed to 
explain a particular economic phenomenon. In some cases, one 
model is nested within another and so it is possible to assess which is 
more appropriate using the types of procedure described in Sections 
4.2 through 4.4. However, in other cases the competing models are 
not nested in this fashion, and so alternative procedures must be 
developed. As will be seen, this type of question is much harder to 
address within the GMM framework without further restrictions. 
Section 4.6 discusses the difficulties and describes various proce- 
dures which have been proposed for testing non—nested hypotheses. 


Inevitably, this type of review must be selective. We have 
chosen to focus on inference procedures which highlight the unique 
aspects of the GMM framework. This choice is guided by the types 
of problems that arise in applications in economics and finance. 
Obvious omissions are hypothesis tesing procedures originally pro- 
posed for specific examples of GMM such as maximum likelihood 
or least squares.” In our opinion, this omission is justified because 
little insight is typically gained by reinterpreting these tests from 
a GMM perspective. However, there are some exceptions, such as 
conditional moment and Hausman tests, and so these are briefly 
discussed in Section 4.7. It is also beyond the scope of this chapter 


2 Engle [1984] provides a comprehensive review of these techniques. 
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to provide an introduction to the general theory of statistical hy- 
pothesis testing; the interested reader is refered to Lehmann [1959] 
or Cox and Hinckley [1974]. Finally, it should be noted that we 
concentrate purely on the asymptotic properties of the tests and 
leave a review of their finite sample behavior to Chapter 5. 


We conclude this introduction by stating certain conventions 
which are maintained throughout to enable us to focus more clearly 
on the issues associated with hypothesis testing. The data {x;; t = 
1,2,...T} are assumed to be a realization from a stationary and 
ergodic process. In most cases, it is desired to perform inference 
based on the GMM estimator calculated with the optimal weighting 
matrix and so we restrict attention to this case. Accordingly, Or is 
defined to be the value of @ which minimizes 


Qr(9) = fr(0)' V` fr(9), 


where fr(@) = T! E}, f(a4,9) and Vr is a consistent estimator 
of V = limr_,.. Var[I'/? fr(6o)|; see Chapter 3. All the procedures 
are based on asymptotic theory and so it is assumed that the 
necessary regularity conditions apply; see Chapter 1.3 The symbol 
À denotes “asymptotically distributed as”; N (u, £) denotes a mul- 
tivariate normal distribution with mean yp and covariance matrix 
x; x2(b) denotes a non-central x? distribution with a degrees of 


freedom and non-centrality parameter b but the latter is omitted 
if b= 0. 


4.1 Identifying and Overidentifying 
Restrictions 


We begin with a description of the decomposition of the population 
moment condition into identifying and overidentifying restrictions. 
This framework was first explicitly characterized by Sowell [1996a], 
although its existence was clearly exploited implicitly in earlier 
work such as Newey [1985a]. 


The decomposition stems from the set of first order conditions 
associated with the minimization of Qr(0), 


AQr (br) /00 = Fr (Or) Vz" fr(Or) =0. (4-2) 


3 Also see Hansen [1982], Gallant and White [1988]. 
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Sowell [1996a] observes that since Or satisfies (4-2) it must also 
satisfy ‘a 7 
PV; fr(6r) =0, (4-3) 


where P = M(M'M)"'M’, M = V? Fr(êr) and Fr(6) = 
Ofr(@)/00’. Therefore, although we began with the information in 
(4-1), GMM estimation is actually based on the population analog 
to (4-3), namely 


PVP E[S (a4, 4)] = 90, (4-4) 


where P = M(M'M)-1M', M = V~-'/*E[F,(60)] and F,(@) = 
Of (x1,9)/00’. Sowell [1996a] termed the elements of (4-4) the 
identifying restrictions for 09, if Oo is identified then. The projection 
matrix P is of rank p and so these restrictions set only p unique 
linear combinations of the (q x 1) vector E[f (xz, 0o)] to zero. Since 
q > p, this does not necessarily imply (4-1), and so clearly a 
part of the original moment condition is unused in estimation. By 
definition this remainder is 


(I, — P)V P E[S (x1, 9)] = 0 (4-5) 


and this equation contains what Hansen termed the overidentifying 
restrictions in his original article. 


This decomposition has fundamental implications for two sta- 
tistics which play a central role in hypothesis testing. By definition, 
the identifying restrictions are imposed in the sample to obtain 
Or. It is therefore not surprising that the function of the data 
in these restrictions determines the asymptotic distribution of the 
estimator. It can be shown that 


T"? (67 ~ 0) = —(M'M)1M'V-1?T1/? f2(8)) + op(1) (4-6) 
and so MT’/2(8 — 6p) is asymptotically equivalent to — PV 7/2712 
fr(8o). Notice also that (4-6) implies 

T'?(9— 6) È N (0, (M'MY`) . 


The other statistic of interest is the sample moment condition, 
fr(@r). These sample moments are in fact the the sample analog to 
the function of the data in the overidentifying restrictions because 
using (4-3) we have 


(Iq — P)Vp"? fr(Or) = Ve fr(6r) - (4-7) 
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This means that Q7(67) measures how far the data are from sat- 
isfying the overidentifying restrictions in the sample. Given this 
structure it is not surprising to discover that the asymptotic dis- 
tribution of these sample moments is determined by the function 
of the data in the overidentifying restrictions. Since P converges in 
probability to P, it follows that from (4-7) 


Vr PT? fr(êr) = (Is co P)\V-/?7"/? fr(0o) + 0,(1) (4-8) 


and so pi ` 
Vp PTV? frr) > N(0,1,-P). (4-9) 


One final aspect needs to be noted: the decomposition is or- 
thogonal because of the projection matrix structure. This implies 
the asymptotic covariance of the two statistics is zero because from 
(4-6) and (4-8) 

Jim Cov|T/? (Or — bo), T? fr(êr)] = (M'M)~1M"(I, — P) =0 
(4-10) 
and hence the two statistics are asymptotically independent. This 
property plays an important role in our discussion. 


4.2 Testing Hypotheses About £E[f(«;,4%)] 


The population moment condition in (4-1) is derived from the set 
of assumptions which make up the underlying economic and or 
statistical model. By their nature, these assumptions may or may 
not be correct, and so it is important to assess whether (4-1) is 
consistent with the data. For reasons that will become apparent, 
it is most convenient to state this hypothesis formally as* 


Ho: VT"? Elf (az, 4)] = 0. (4-11) 
Using the decomposition described above it is clear that Hp = HE N 
H®, where 
Hi: PV-Elf (a1, 0)] = 0 
Hy: (Ig — P) VP Elf (z+, 90)] = 0. 


4 Since V is positive definite by assumption, V7 


(4-1) and (4-11) are equivalent. 


1/2 is nonsingular and hence 
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Our earlier discussion implies that it is not possible to test HZ 
because the sample analog to the identifying restrictions are au- 
tomatically satisfied by the estimated sample moments. However, 
HỌ can be tested because the overidentifying restrictions are not 
imposed during estimation. Hansen [1982] proposes testing this 
null hypothesis using the statistic 


Jr = T Qr(6r) 


and shows that it converges to a x?_, distribution under HẸ. This 
result can be derived heuristically by using (4-1) and (4-11) to 
deduce that 


Jp À 2 (Iq — P)'(Ig— P)z = A(T — P)z, (4-12) 


where z, ~ N(0,J,). The quoted distribution then follows directly 
from (4-12), for example, see Judge et al. (1985, p. 943.]. 


This statistic is known as the “overidentifying restrictions test” . 
It has become a standard diagnostic for models estimated by GMM 
and is routinely calculated in most computer packages. Its name 
reflects the focus of the statistic and it is instructive to investigate 
its properties more formally using a local power analysis. To this 
end, we introduce the following sequences of local alternatives to 
HZ and HE 


Hi: PVP E[S (£i, 9)| = TO? Pn = Tur 
HÌ :[I, — P]V~V?Elf (a1, 90)| = TP — Plno = T? po 


in which np; # 0,uo # 0. Notice that it is always possible to 
decompose V~/?E/f (xz, 99) into 


V El f (£, O0)] = 
PVP EIS (x2, 0)] + Ug —P) VELS (a2, 00) 


and so, using (4-13), H} and H@ translate directly into sequences 
of alternatives to V72 E[f (az, 09)| = 0. 


First, we verify that Jr has power against violations of the 
overidentifying restrictions. It is pedagogically convenient to treat 
the two types of violations separately and so we also assume that the 
identifying restrictions are satisfied for this part of the discussion, 
i.e., the data satisfy Hi N HQ. In this case it follows from (4-13) 
that A 

T? fOr) > N(( - Pno, h- P) 
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and so 


A 
Jr ~ (z +no) (I; — P)(za +o) = xj_p(Hotto)- 


Since uouo > 0, Jr has power against this alternative. Now con- 
sider what happens if the overidentifying restrictions are satisfied 
but the identifying restrictions are not, i.e., H} N HE. Under this 
sequence of alternatives, it follows from (4-8) and (4-13) that 


T? fr(ôr) È N((Iy—P)Pnr, l- P) . 


However, since (I, — P)P = 0, the mean of this limiting distribution 
is zero and so Jr converges to a Xop distribution. This means the 
statistic has the same distribution under HQ regardless of whether 
Hg or H} holds, and so has no power to discriminate between the 
latter two hypotheses. 


If HE can be rejected then the more general hypothesis Ho can 
also be rejected. However, as we have just seen, the failure to reject 
HỌ does not imply the validity of Hy because Hj may be violated. 
In view of this structure, it is important to understand more about 
the consequences of violations of Hj.° Given their nature, it would 
be anticipated that violations of the identifying restrictions effect 
the parameter estimator in some way. In fact, using similar analysis 
to before, it follows from (4-6) that under H4 N HE 


T'/?(9 — 0o) © N (—(M'M)-!M'n;, (M’M)~*) 


and so the asymptotic distribution of Or is biased away from 6o.° 
It is easily verified using (4-6) that the asymptotic distribution in 
of Êr is uneffected by whether the data satisfy H or HÌ; so it is 
only local violations of the identifying restrictions which cause this 
bias. In one sense, these are the types of violation which are of 
most concern because they effect inferences about the parameters. 
However, it is an inevitable consequence of their imposition during 
estimation that they cannot be tested. The only hope is to examine 
the overidentifying restrictions for any evidence of model misspeci- 
fication, but clearly caution must be exercised in the interpretation 
of an insignificant statistic. 


5 Many of the points made in this paragraph are qualitatively similar to those 


made by Newey [1985a]. However, Newey does not explicitly invoke the 
decomposition into identifying and overidentifying restrictions. 

However note that the sequence of alternatives is carefully constructed so 
that 8r is still consistent for 8o. 
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If (4-1) can be rejected, then this may be because some or 
all of the elements of E[f(z:,9)| are nonzero. In many cases 
different elements of (4-1) refer to different aspects of the model, 
and in some cases there may be a priori information which indicates 
the misspecification is confined to certain elements of (4-1). In 
these circumstances it may be useful to test whether the data are 
consistent with this pattern of misspecification. This is the topic 
of the next section. 


4.3 Testing Hypotheses About 
Subsets of E[f(z:,6)| 


To present methods for testing hypotheses about subsets of the 
population moment condition, it is useful to define the following 
partitions of 6) and f(.). Let 6) = (051,92) where Oo; is (p; x 1), 
and f (xz, 90)’ = [fi (£t 01)’, fo(z+, 90)’] where fi(.) is (q: x 1). 

As with the previous section, we begin with a formal statement 
of the null and alternative hypotheses, and then consider the extent 
to which various statistics achieve this goal. This null hypothesis 
is 

HŠ : Elfi(x:,401)] =0 and E[fo(2xz,4)] = 0 
and the alternative is 
H&: Elfi(as,01)] =0 and Elfo(xz,9%)] = TP p, 


where for convenience we have stated the H$ as a local alterna- 
tive. Two features of this specification should be noted. First, 
the veracity of E[f1(x:,401)| = 0 is maintained under both null 
and alternative; so the potential misspecification is confined to 
E|fo(x+,90)]. Second, this framework allows for the possibility that 
the maintained moment conditions, E[f,(x+, 901)] = 0, only depend 
on part of the parameter vector. 


Using the decomposition described in Section 4.1, it is seen 
that a violation of E|f2(z,,0o)] = 0 can impact on the identifying 
and/or overidentifying restrictions. In fact, all the statistics pro- 
posed for testing H versus H$ are actually based on just one of 
these components. We begin by considering tests which focus on 
the overidentifying restrictions. To motivate these statistics, it is 
useful to briefly abstract to a world in which ĝo is known. If fr(.) is 
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partitioned in the same way as f(.) then it follows from the Central 
Limit Theorem that its asymptotic distribution under H is 


Pania N ( bel KE | ) 

T? fro (90) He}? [Va Vz ? 

where 0,, is a (qı X 1) vector of zeros and V;; define the obvious par- 
tition of V. This implies the conditional distribution of T!/? fra(8o) 


given T!/2 frı(ĝo1) is also normal and can be most conveniently 
expressed as 


= d = 

T? fra(bo) — Va Va T? fri(o1) È N (pz, Vaz — Va Vi Vi) - 
(4-14) 

If 6) was known then the statistic on the left hand side of (4-14) 

is the obvious function of the data to use for inference because its 

mean is zero under HŠ and pz under H$. However, 0 is unknown 

in practice, and so some adjustment needs to be made. Newey 

[1985a] proposes substituting 6; for 6) and then basing inference 

on 

= T"? fro(6r) — Va Va T? frs (O11) , 


where 07 = eos has been partitioned conformably with 4 
and V;; is a consistent estimator of V;;. The test statistic is 


1 A- 
Nr=nrQrnrT, 


where Or is a consistent estimator of Q = L(I, — P)L’, L = 
[—VaiViq', I} ]}V™ and A- denotes a generalized inverse of a matrix 
A.” Newey [1985a] shows that Nr converges in distribution to x? 
distribution under Hf where v, = rank(Q) and Ahn [1995] shows 
that vı = qo — p2. Notice that vı equals the degree to which 62 is 
overidentified by E| fo(xz, 49)] = 0 if 001 is known. 


In the previous section, it was shown that T1? fr(ôr) actually 
captures the information in the overidentifying restrictions. There- 
fore, since Nr is based on this statistic, the implicit null hypothesis 
of Nr must involve only the overidentifying restrictions and not the 
population moment condition per se. This intuition can be more 
formally justified by observing that nr can be rewritten as 


nr = [-Va Viz", Ia] T? fr (Or) 


7 There are many ways of constructing a generalized inverse but not all possess 


the property that plimT—oo Q7 = Q7. Newey [1985a] discusses conditions 
under which this property holds and these are assumed to apply here. 


106 Hypothesis Testing in Models Estimated by GUM 


and so under both H and H$ it follows that 
np = L{I, — PIV PT? fr(09) + op(1). 
Therefore, the implicit null hypothesis of the test is 
QLI — PIV E[S (x, 0)] = 9, (4-15) 


which clearly involves a linear combination of the overidentifying 
restrictions. Although perhaps not readily apparent from (4-15), 
it can be shown after a lot of algebra that this implicit null hy- 
pothesis is violated by any sequence of alternatives in H3 which 
also violate H?; see Ahn [1995]. However, it is clear from (4-15) 
that the statistic has no power against alternatives which violate 
the identifying restrictions alone. 


The form of the statistic, Nr, is very useful for developing the 
intuition behind the tests but is a little cumbersome to construct 
in practice. Fortunately, Ahn [1995] shows this statistic is asymp- 
totically equivalent to a more convenient statistic which had been 
independently proposed by Eichenbaum, Hansen, and Singleton 
[1988]. This statistic involves two estimations and is similar in 
spirit to the likelihood ratio statistic from maximum likelihood 
theory. The first estimation is based on the full set of moment 
conditions in (4-1). The second involves the estimation based on 
only the population moment conditions which are maintained to be 
true under both Hf and H$. Eichenbaum, Hansen, and Singleton’s 
[1988] statistic is then T times the difference in these two estimated 
GMM minimands, namely 


Cr =T{Qr(6r) - Qir(6ir)}, 
where Ov is the value of 6, which minimizes 


Qir = fri (Y Pa fra (01). 


The asymptotic equivalence of Cr and Nr holds under both HŜ 
and H$, and so the previous comments about the implicit null 
hypothesis of Nr apply equally to Cr.® 


The statistics Nr and Cr offer a method for the detection of 
E|fe(xz,90)| # 0 via its impact on the overidentifying restrictions. 


8 Newey [1985a] also proposes another asymptotically equivalent version of 
these statistics which can be used if p = 0. It is similar in construction 
to Nr but is evaluated at 0:7. However for brevity, we refer the interested 
reader to Newey [1985a] for the exact details. 
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It would clearly be desirable to develop complementary statistics 
for testing Hj which are sensitive to violations in the identifying 
restrictions. In view of our earlier comments, this might seem 
impossible, but there is a crucial difference here which makes it 
possible under certain conditions. To motivate these conditions, 
we briefly reconsider testing HZ against H4. The problem there 
was that we tried to test the information upon which estimation 
was based using the estimator obtained from that information; 
and this cannot be done because of the fundamental relationship 
between the two. However, if pp = 0 and q > p then we can 
use E|f1(x1,90)] = 0 to provide an alternative estimator of 00, Or 
say, which does not satisfy the identifying restrictions in (4-3) by 
construction. Inference can then be based on 


Îr = P fr(6r), (4-16) 


where P = M(M’M)-!M and M = V; Fr(ĝr) and Vr is a 
consistent estimator of V based on 67. Intuition suggests that if 
H$ is satisfied then Ty should be zero allowing for sampling error. 
The null hypothesis of such a test is® 


He? : PVP E[S (a4, 90)| -M(MiM,) MiV El fi (xe, 80)] =0, 


where Mı = VIF, and Fy = E[Of,(24,90)/00']. Therefore if 
E|fi(x,90)] = 0 then the test is clearly sensitive to alternatives 
which satisfy both HÌ and H}. Rather than formally derive a 
statistic based directly on Ir, it is more convenient to develop a 
transformed version which can be conveniently expressed in terms 
of the parameter estimators from two separate estimations. To 
do this, notice that the projection matrix structure implies the 
information in Ir is equivalent to the information in 


hr = (Fp Vr Fr) F Vr" fr(6r) . 
Newey [1985a] shows hr = T2 (r — Or) + o,(1) under both 


H§ and H$ and so it is possible to test HỌ using the Hausman 
[1978] type statistic 


Hy =T(6r — 67) V; (Or — Gr), 
where V, is a consistent estimator of 
Vp = (Fi Vi A) — (F'V FY. 


° This follows from an application of the Mean Value Theorem to fr(êr) 


around ĝo in (4-16) and using an analogous expression to (4-7) for 0 — bo. 
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Newey [1985a] proves that Hr converges to a x?, distribution under 
HE where vz = rank(V;,). In most cases, this rank is equal to p and 
so the generalized inverse can be replaced by the inverse; however 
this need not always be the case. 


One final comment is in order. All the statistics in this section 
are designed to test for violations of E[f2(x1,40)| = 0 conditional 
on the validity of E[fı (£+, 80) = 0. However, if the latter is untrue, 
then this may cause the statistic in question to be significant even 
though E[f2(x+, 9)] is actually zero. 


4.4 Testing Hypotheses About the 
Parameter Vector 


It is often the case that a particular aspect of an economic theory 
translates into a set of restrictions on the parameter vector of the 
econometric model. So the veracity of the theory can be assessed 
by testing whether the restrictions in question are satisfied by the 
data. This section describes various methods for performing this 
type of inference. 


We adopt the general framework in which it is desired to test 
HĒ : r(0)=0 versus HŽ: r(6o)=T "ur, 


where r(.) is a (s x 1) vector of continuous differentiable functions 
denoted by R(@) = Or(0)/06’. Once again, we have expressed 
the alternative in a local form for brevity; it is assumed that 
ur #0. The mapping r : RP — R° must also satisfy certain other 
restrictions. First, the number of restrictions, s, cannot exceed 
the number of parameters.!° Second, the restrictions must form a 
coherent set of equations so that given the value of p—s elements of 
Oo it is possible to solve uniquely for the remaining s values using 
r(@)) = 0. This property is guaranteed by the Implicit Function 
Theorem if the rank of R(9o) is s, for example see Apostol [1974, 
p. 374]. 


Newey and West [1987] develop the theory for testing this 
type of null hypothesis. They propose three main statistics which 


10 Tt takes at most p unique restrictions to define the the value of the p pa- 
rameters. So if, s > p then either certain restrictions are redundant or the 
system r(09) = 0 is inconsistent and hence has no solution for 80. 
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can be viewed as extensions to the GMM framework of the Wald, 
Likelihood Ratio (LR) and Lagrange Multiplier (LM) tests from 
maximum likelihood theory.!! To facilitate the presentation, it is 
useful to define unrestricted and restricted estimators of 0. The 
unrestricted estimator is just 0r defined earlier. The restricted esti- 
mator is the value of 0 which minimizes Q7(0) subject to r(@) = 0; 
this is denoted 97. It is assumed that both these minimizations use 
the weighting matrix Vp. We now introduce the three statistics 
in turn. 


The Wald test examines whether the unrestricted estimator, 
Or, satisfies the restrictions with due allowance for sampling error. 
The statistic is 


Wr = Tr (r) [R@r) ARG) rêr). 


The D or LR-type test examines the impact on the GMM minimand 
of the imposition of the restrictions. This statistic is 


D = T[Qr(6r) = Qr(6r)] ; 


Finally, the LM test examines whether the restricted estimator, Or, 
satisfies the first order conditions from the unrestricted estimation. 
This statistic is: 


= a Eee = 
LMr =T frry Vp? P Vp? fr(6r), 
where P = M(M’'M)-!M' and M = Vp"? Fr(ĝr). 


Newey and West [1987] show that all three statistics are as- 
ymptotically equivalent under both H® and H: under HË the 
statistics converge to a x? distribution and under H® to a x?(ôr) 


where 


ôr = bp (R(4)(M’M)* Ro) ur > 0. 


So the statistics have power against the alternative for which they 
are designed. In view of their equivalence, some other criteria 
must be used to choose between the three. One such criterion 
is computational burden, although this is less of a concern now 
than it once was. The D statistic is more burdensome because it 
requires two estimations, whereas the Wald and LM only require 
one. Sometimes the unrestricted estimation is easier and sometimes 
not, it all depends on the model in question and the nature of r(.). 


11 Newey and West [1987] also analyze a minimum Chi-squared statistic, but 
this is rarely used and so is omitted here. 
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However, the Wald test has two disadvantages which should be 
mentioned. First, it is not invariant to a reparameterization of the 
model or the restrictions. This means that it is possible to rewrite 
the model and restrictions in a logically consistent way, but end up 
with a different Wald statistic.1? Neither of the other two tests have 
this problem.!? The second disadvantage is that the Wald statistic 
tends to be less well approximated by the x? distribution in finite 
samples than the other two statistics, see, for example, simulation 
evidence reported in Gallant [1987]. 


It is possible to re—-express this hypothesis in terms of the iden- 
tifying restrictions associated with GMM estimation and thereby 
gain useful insights into the interpretation of a significant statistic. 
To set up this analysis it is useful to return to the identifying 
restrictions for the unrestricted estimation given in (4-4). Sowell 
[1996a] observes that these restrictions state that the projection 
of V-12 E[f (az, 0o)] onto the column space of M, C(M), is zero. 
We now show that HẸ can be interpreted as a hypothesis about a 
sub-space of C(M). To do this notice that if HË is true then the 
Implicit Function Theorem implies that the population moment 
condition can be written as 


Elf (z+, 9(%o))] = 0 (4-17) 


where Wo is a p— s vector which satisfies 09 = g(wo). Now, if (4-17) 
is treated as a basis for GMM estimation of pọ then the associated 
identifying restrictions imply the projection of V~'/?E[f (z+, 0o)] 
onto the columnspace of My = V~'/7E[Of (x+, 9(Wo))/Ov'] is zero. 
However, since 
My = M {09(0)/O¥"}, 

it follows that the column space of My, C(M,), is of dimension 
p — s and C(My) C C(M). 


This interpretation is useful because it emphasizes an impor- 
tant aspect of H: it is a hypothesis about the projection of 
V-12 El f (xz, 90)] onto C(M,) conditional on its projection onto 
C being zero. So the hypothesis may be rejected because the 
identifying restrictions are satisfied at 0o but r(@)9) Æ 0 — or it 
may be rejected when r(@)) = 0 because the identifying restrictions 


12 For example the restriction ĝo; = 0 can also be rewritten as 6%, = 0 for any 
finite positive integer k. 

13 See Davidson and MacKinnon [1993 p.467-469] for a useful discussion of this 
issue and some examples. 
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are not satisfied at the true parameter value, 69. This emphasizes 
the importance of using the tests in Sections 4.2 and 4.3 to assess 
the model specification before performing inference about the pa- 
rameters. It also highlights that Wr, Dr and LMr are functions of 
the identifying restrictions and so from (4-10) are asymptotically 
independent of the previous statistics based on the overidentifying 
restrictions under Ho or any of the local alternatives considered 
above. 


4.5 Testing Hypotheses About 
Structural Stability 


So far, it has been assumed that if (4-1) is violated then the value of 
E| f (x+, 90)] is the same for all t. This property is refered to as struc- 
tural stability. However, (4-1) is also violated if E|f (zz, 0o)] 4 0 for 
only part of the sample; such behavior is termed structural insta- 
bility. This section reviews various methods for testing structural 
stability based on GMM estimators. 


The null hypothesis for structural stability tests is very simple: 
it states that some aspect of the model has remained constant over 
the sample. The alternative is more difficult, however, because 
it must specify how the model changes. In the GMM literature, 
attention has focused almost exclusively on the case where the 
instability involves a discrete change at a single point in the sample 
known as the “breakpoint” .'* To present the null and alternative 
hypotheses, it is necessary to introduce the following notation. Let 
m be a constant defined on (0,1) and let nT denote the potential 
breakpoint at which some aspect of the model changes. For our 
purposes here, it is convenient to divide the original sample into 
two sub-samples. Sub-sample 1 consists of the observations before 
the breakpoint, namely Tı = {1,2,...,[77]}, where [.] denotes the 
integer part, and sub-sample 2 consists of the observations after the 
breakpoint, T3 = {[7T]+1,...T}. This breakpoint may be treated 
as known or unknown in the construction of the tests. If it is known, 
then the breakpoint is specified a priori by the researcher and it 


14 Sowell [1996a] provides a general framework for considering the design of 
tests for structural stability. We briefly discuss other forms of instability at 
the end of this section. 
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is only desired to test for instability at this point alone.’ If the 
breakpoint is unknown, then the null is the broader hypothesis that 
there is no instability at any point in the sample. It is easily imag- 
ined that tests for the two cases are closely related. We begin our 
discussion with the simpler case in which the breakpoint is known 
because this provides a more convenient setting for introducing the 
statistics and contrasting their properties. We then consider the 
extension of these techniques to the unknown breakpoint case. 


Andrews and Fair [1988] propose tests for “parameter varia- 
tion”. Their statistics are most easily presented by introducing the 
augmented population moment condition: 


= di(m) f (xe, 01) = 

Baten = [a Saim 18) 
where d;(7) is a dummy variable which equals one when t < nT and 
o = (9,05). This population moment condition is more general 
than (4-1) because it allows for the possibility that E[f (x+, 0)] = 0 
is satisfied at different parameter values before and after the break- 
point. However, it also contains (4-1) as a special case because 
if 6; = ĝa then the moment condition is satisfied at the same 
parameter value throughout the sample. This restriction is most 
easily tested by using (4-18) as a basis for GMM estimation of o 
and then calculating the Wald, LR- or LM-type statistics from the 
previous section for the hypothesis: [J,,—I,]¢o = 0). Whichever 
version is chosen, it has a limiting x distribution if the restriction 

holds. For example, this route yields the Wald statistic 


Wr(m) = T (e(r) — O:(m)) Pwl) (O(n) — B(m)) , 
where 6;(7) is the two step GMM estimator based on the sub- 
sample T;, 

Viv (m) = [M (r) My (r)a + [My Mo(n)]*/(1 a), 
Mln) = P(r) 12 Lier, OF (x, 6;())/00' and V;(7) is a consistent 
estimator of V based on T;. 


Ghysels and Hall [1990b] take a different approach. They focus 
on testing the null hypothesis 


Ho : E|f (xz, 90)} = 0 te Ti N Tə 


15 For example, Ghysels and Hall [1990a] investigate whether the change in op- 
erating procedures by the Federal Reserve in October 1978 caused instability 
in certain asset pricing models. 
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against the (local) alternative 
H} : E|f (az, 80)] =0 te Tı and 
Elf (a1,90)] = TP u t € Ta, 


where u2 # 0. Their statistic is based on evaluating the sample 
moments from T at 0,(7). Under Ho, this estimated sample mo- 
ment should converge in probablity to zero. This approach leads 
to the Predictive test statistic 


PRr(m) =T Y f(e, O(n)! Vag T Y f(a, O(n), 


tET2 tET2 


where Vpr is a covariance matrix defined in Ghysels and Hall 
[1990b]. Ghysels and Hall [1990b] show that this statistic converges 
to a x7 distribution under Hp.'® This limiting distribution has more 
degrees of freedom than Wr(r) and so PRr(z) is clearly testing 
something more than parameter variation. 


The difference between the two types of tests is best under- 
stood by recasting the hypotheses in terms of the identifying and 
overidentifying restrictions. Since the identifying restrictions are 
imposed in estimation, there are always parameter values which 
satisfy them in each of the two-subsamples. So the identifying 
restrictions are structurally stable if they are satisfied by the same 
parameter value in each sub-sample. This null is formally stated 
as 

Hi(m): PV Elf (x:,%)] =0, tenm 
PV-VElf(21,90)| =0, t € To. 


The overidentifying restrictions are stable if they hold before and 
after the breakpoint. This is formally stated as HO (x) = HE! (r) N 
H?(m) where 


HỌ (r): [q -— PIV PEI (20) =0, ten 
HẸ? (1) : [Iq n Pa] Vy P E[S (2, 02)] = 0, te Tə, 


where P; = M;(M/M;)-!M!, M; = V; P E[F;(6:)] and V; is the 
analog to V defined on T; for i = 1,2. Notice that HỌ!(r) and 
HẸ? (r) allow for the possibility that the overidentifying restric- 
tions are satisfied at different values. By the very nature of the 


16 Ghysels and Hall [1990a] propose a structural stability test based along a 
similar principle to the Eichenbaum, Hansen, and Singleton [1988] statistic in 
the previous section, but Ahn [1995] shows this is asymptotically equivalent 
to the Predictive test; however their finite sample properties may be different. 
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decomposition, it is clear that any instability must be reflected in 
a violation of at least one of the three hypotheses: Hë (n), H21(m) 
or H2?(m). Therefore these three hypotheses provide a framework 
for contrasting the properties of the above tests. It is easily seen 
that H(z) is equivalent to the null hypothesis of no parameter 
variation and so the Wald, LR- or LM-type statistics are testing 
this hypothesis. However, the orthogonality between the identifying 
and overidentifying restrictions means that these statistics have no 
power against local alternatives to either HO!(m) or HO? (n). In 
contrast Ghysels, Guay, and Hall [1997] show that the Predictive 
test has power against local alternatives to both Hë (r) and HE? (7), 
and this accounts for the difference in the degrees of freedom men- 
tioned above. 


Since instability can manifest itself in a violation of any one 
these three components, it is clear that both the parameter vari- 
ation and Predictive tests have important weaknesses as general 
diagnostics for structural instability. To address these problems, 
Hall and Sen [1996a] propose testing the stability of the identify- 
ing and overidentifying restrictions separately. As we have seen, 
the Wald test!” can be used to test Hj(m). The stability of the 
overidentifying restrictions can be tested using 


Or(m) = Olr(r) + O2r(T), 


where Olr(r) and O27(7) are the overidentifying restrictions tests 
based on the sub-samples JT, and T, respectively. Hall and Sen 
[1996a] show that under Or(r) converges to a x3(,_,) under this 
null hypothesis. The orthogonality of the decomposition ensures 
the tests have some useful properties. Hall and Sen [1996a] show 
that Wr(r) has power against local alternatives to Hj (r), denoted 
H} (m), but none against local alternatives to H?(m), denoted 
HÌ (7). Whereas, Or(7) has power against HÌ (r) but none against 
Hi (r). Furthermore these two statistics are asymptotically inde- 
pendent and so can be used together to diagnose the source of the 
instability. This approach clearly remedies the weakness of using 
the Wald test alone or the Predictive test. Hall and Sen [1996a] 
show that it has the added advantage of allowing the researcher to 
distinguish between two scenarios of interest. The first is one in 
which the instability is confined to the parameters alone; this case 


17 The LR- or LM-type statistics can also be used without effecting any of 
following discussion. 
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is consistent with a violation of Hë (mr) but the validity of HE (r). 
The second scenario is one in which the instability is not confined 
to the parameters alone but effects other aspects of the model; this 
would imply a violation of HO (r) and most likely Hi (r) as well. 


We now consider the extension of these methods to the un- 
known breakpoint case. Once again, it is beneficial to treat the 
hypotheses about the stability of the identifying and overidenti- 
fying restrictions separately.!* So we begin by describing statis- 
tics for testing the composite null hypothesis that H(z) for all 
m € II c (0,1).19 The construction is a natural extension of 
the fixed breakpoint methods. Now W;(z) is calculated for each 
possible m to produce a sequence of statistics indexed by m, and 
inference is based on some function of this sequence. This function 
is chosen to maximize power against a local alternative in which a 
weighting distribution is used to indicate the relative importance 
of departures from H(z) in different directions at different break- 
points. A general framework for the derivation of these optimal 
tests is provided by Andrews and Ploberger [1994] in the context of 
maximum likelihood estimators and this is generalized to the GMM 
framework by Sowell [1996a]. One drawback with this approach is 
that a different choice of weighting distribution leads to a different 
optimal statistic; however, three choices have received particular 
attention. To facilitate their presentation, we define the following 
local alternative to Hi (r), 


Hi (x): PVP E| f(z, 00] =T "Pun, ten 
PVPE| f(n 0] = Turn t € Tr. 


It is assumed that urı = 0 and a weighting distribution is specified 
for (72, 7). If the conditional distribution of prz given 7 is of the 
form rL(r)U where r is a scalar, L(7) is a particular matrix and U 
is the uniform distribution on the unit sphere in R? then Andrews 


18 Ghysels, Guay, and Hall [1997] develop similar extensions for the Predictive 
test but in view of our previous comments about its weaknesses, we do not 
describe these here. 

In applications it is desireable for II to be as wide as possible, but at the 
same time not too wide that asymptotic theory is a poor approximation in 
the sub-samples. For example, in applications to models of economic time 
series, it has become customary for II = (0.15, 0.85). 

For these tests of parameter variation, the roles of p71, H2 can be 
interchanged. 


19 


20 
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and Ploberger [1995] show that for r sufficiently large the optimal 
statistic is 


SupWr = sup { We(7) Fs 


This statistic had been proposed earlier by Andrews [1993] who 
derives and tabulates the non-standard limiting distribution. The 
other two choices arise from a specification for the conditional dis- 
tribution of nz given m as N(0,cxX,,), for some constant c. Andrews 
and Ploberger [1994] and Sowell [1996a] show that for a particular 
choice of &,, the optimal statistic only depends on c and not L,. So 
for convenience this choice is made and then attention has focussed 
on two values of c. If c = 0 then the optimal statistic takes the 
form 


AvWr = i Wr(m) dJ(m), 


where J(7) defines the weighting distribution for the breakpoint 
over II. If c = co then the optimal statistic takes the form 


exp Wr = log Fi exp|0.5W7(7)| a(n) } ; 


Andrews and Ploberger [1994] derive and tabulate the non- 
standard limiting distributions of AvWr and exp Wr under the 
assumption that J(7) is a uniform distribution on II.” 


The same ideas can be used to construct tests of the null 
hypothesis that H (r) holds for all m € II. However, this time 
the exact nature of the alternative is crucial. To show how, we 
must specify the following sequences of alternatives: 


HI (x): — P)Vi Efa) =T pn, tet 
H? (x) : (Iq — Po] Vo EL f (a1, 02)] = T7"? HWo2, tE. 


Sowell [1996b] derives optimal tests for testing this null against 
an alternative in which the overidentifying restrictions are violated 
before but not after the breakpoint, i.e., H9'(r)N HE? (r), m € IL. 
These statistics are based on the forward moving partial sums, 
Gi = T! Yt f(£e ôr). Using similar kinds of weighting dis- 
tributions on j1o1 to those described above, Sowell [1996b] derives 
statistics which involve the same types of functional. However, Hall 
and Sen [1996b] show that these statistics are sub-optimal against 


21 Hansen [1997] reports response surfaces which can be used to calculate p- 
values for all three versions of these tests. 
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an alternative in which the overidentifying restrictions are violated 
after instead of before the breakpoint, i.e., HO! (m) N H9?(x). Hall 
and Sen [1996b] derive the optimal tests in this case and they 
involve similar functionals but are based on the backward moving 
partial sums G2 = T-! EZ, f(a:,0r). Typically, a researcher 
does not wish to pick either of these particular alternatives but is 
interested in the more general alternative in which the overidenti- 
fying restrictions may be violated before or after the breakpoint, 
ie., {H9 (r) N HE? (r)} U {HF (1) A HO? (r)} m € IL. Hall and 
Sen [1996a] propose testing this hypothesis using the statistics 


Sup Or sup { Or(7) } 
nell 


Ayo f Or(7) dJ (7), 


exp Or = log G exp[0.507(7)| a(n) } 


and derive their limiting distributions under HE (r), n € TI; they 
also tabulate these distributions for the case where J(7) is a uniform 
distribution.?? Hall and Sen [1997a] simulate the distributions of all 
the above statistics for testing the stability of the overidentifying 
restrictions. This evidence indicates that the statistics based on 
Or() are more powerful unless there is strong a priori information 
about the side of the breakpoint on which the violation occurs. 


Although these statistics are designed to test for a discrete 
change at a single point in the sample, they can all be shown to 
have non-trivial power against other forms of instability. One other 
specific case is briefly worth mentioning; this is where the true value 
of the parameter changes continually via a martingale difference 
sequence. In this case, Hansen [1990] shows that the LM statistic 
for parameter constancy is well approximated by AvW; and so this 
statistic is likely have good power properties against this alternative 
as well.?3 


Finally, it should be noted that all these procedures rely on 
asymptotically large samples and so are unlikely to have good power 
properties against instability at the very beginning or end of the 


22 Hall and Sen [1997b] provide response surfaces for calculating p-values for 
all the unknown breakpoint versions of the test based on Or(z). 

23 Hansen [1990] also proposes a variation on this statistic. His analysis is 
motivated by earleir work due to Nyblom [1989] in the context of maximum 
likelihood estimators. 
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sample. Dufour, Ghysels, and Hall [1994| propose a Generalized 
Predictive test which can be applied in this situation. The null 
and alternative hypothesis are the same as the Predictive test 
except this time only T, need be asymptotically large and T3 
may be as small as one observation. The statistic is based on 
{f(z,01(7)), t € T2} and not the sub-sample average. Since the 
focus is now the individual observations, it is not possible to use a 
conventional asymptotic analysis to deduce the distribution. One 
solution is to make a distributional assumption, but this is unattrac- 
tive in most GMM settings. Therefore Dufour, Ghysels, and Hall 
[1994] consider various distribution free methods of approximating 
or bounding the p-value of their statistics. 


4.6 Testing Non—nested Hypotheses 


So far, we have concentrated on methods for testing hypotheses 
about population moment conditions or parameters within a par- 
ticular model. However, in many cases more than one model has 
been advanced to explain a particular economic phenomenon and so 
it may become necessary to choose between them. Sometimes, one 
model is nested within the other in the sense that it can be obtained 
by imposing certain parameter restrictions. In this case the choice 
between them amounts to testing whether the data support the 
restrictions in question using the methods described in Section 4.4. 
Other times, one model is not a special case of the other and so they 
are said to be non-nested. There have been two main approaches 
to developing tests between non—nested models. One is based on 
creating a more general model which nests both candidate models 
as a special case; the other examines whether one model is capable 
of explaining the results in the other. Most of this literature has 
focused on regression models or models estimated by maximum 
likelihood. While these situations technically fall within the GMM 
framework, they do not possess its distinctive features and so are 
not covered here.”* Instead, we focus on methods for discriminating 
between two non-nested Euler equation models. These models 
involve partially specified systems and so involve aspects unique 
to the GMM in its most general form. 


24 These techniques are well described in the recent comprehensive review by 
Gourieroux and Monfort [1994]. 
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We consider the case where there are two competing models 
denoted M1 and M2. If M1 is true then the parameter vector and 
6, and the data satisfy the Euler equation 


Fy [ur (22, 01)|Q:-1] = 0, (4-19) 


where 2,_; is the information available at time t — 1 and E;[.] 
denotes expectations under the assumption M1 is correct. For 
our purposes, it is sufficient to assume the Euler equation residual 
u(x, 01) is a scalar. From (4-19) it follows that the residual is 
orthogonal to any (qı x 1) vector zı € Q4-1, and this yields the 
population moment condition 


Fy [2141 (x2, 01)| = 0. (4-20) 


Using analogous definitions, M2 leads to the (q2 x 1) population 
moment condition 
Ez [erua (2s, 62)] =0, (4-21) 


where again the Euler equation residual is taken to be a scalar. 
It is assumed that the two models are globally non-nested in the 
sense that one model is not a special case of the other.?”” Since both 
models can be subjected to the tests in Sections 4.2-4.4, there can 
only be a need to discriminate between them if both models pass 
all these diagnostics; so we assume this is the case. 


As mentioned above there are two main strategies to developing 
non-nested hypothesis tests and each has been applied within the 
context of Euler equation models. Singleton [1985] proposes nesting 
the Euler equations of M1 and M2 within the Euler equation of 
a more general model. Ghysels and Hall [1990c] propose tests of 
whether one model can explain the results in another. We now 
describe these in turn. 


Singleton’s [1985] analysis begins with the observation that if 
M1 is false and its overidentifying restrictions test is insignificant 
then it must be because the test has poor power properties when 
M2 is true. Therefore, he proposes choosing the linear combination 
of the overidentifying restrictions which has the most power in the 
direction of M2. The problem is how to characterize this direction. 
Singleton [1985] solves this issue by introducing a more general 


25 See Pesaran [1987] for a formal definition of nested, partially non-nested 
and globally non-nested models. The distinction between the last two can 
be important but need not concern us here. 
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Euler condition which is the following convex combination those 
from M1 and M2, 


Eg [er (41, O2, w)|Qe-1] = 0 5 (4-22) 
where 
e:(01, 02, w) = wui (z1, 01) + (1 —w) ulzi b2), 


where 0 < w < 1 and Eg[.] taken with respect to the true dis- 
tribution of the data under this more general model. Notice that 
w = 1 implies M1 is correct, and w = 0 implies M2 is correct. The 
other values of w imply a continuum of residual processes which 
lie between those implied by M1 and M2 in some sense. If w is 
replaced by a suitabley defined sequence wr which converges to one 
from below at rate T!/? and 244 = zo: = %, then 


Eo|ze+(1, 02, w)] = 0 


defines a sequence of local alternatives to (4-20) in the direction 
of (4-21) Singleton [1985] shows that the linear combination of the 
overidentifying restrictions in M1 which maximizes power against 
this local alternative is 


Arp =Viz (fir Grr) = for @2r)) ; 


where fir(@r) =T= Dai ZU (Tt, ô), Vir is a consistent estima- 
tor of limz_,., Var[T!? fır(01)] and ır is the GMM estimator of 
6;. This leads to the test statistic 


NNr(1,2) = T fir(@ir)' Ar (A SAA Arfir(Orr) ; 
where 
Yar = Vir — PalPGVig hig) Pr and Êr = Ofir ir) /09' . 


Singleton [1985] shows that if M1 is correct then NN-(1, 2) con- 
verges to a x? distribution. The roles of M1 and M2 can be 
reversed to produce the analogous statistic N Nr(2,1) which would 
be asymptotically x? if M2 is correct. In fact, the test should 
be performed both ways and so there are four possible outcomes: 
NN7(1, 2) is significant but NN7(2,1) is not and so M2 is chosen; 
NNr(2, 1) is significant but NNr(1, 2) is not and so M1 is chosen; 
both NNr(1,2) and NN7(2,1) are significant and so both models 
can be rejected; both NN7(1,2) and NN7(2,1) are insignificant 
and so it is not possible to choose between them in this way. 
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This approach is relatively simple to implement because it does 
not require any additional assumptions or computations beyond 
those already involved for the estimation of M1 and M2. Its 
weakness is that the convex combination of the Euler equations 
from M1 and M2 may not be the Euler equation of a well defined 
economic model.?® In such cases, it is unclear how a significant 
statistic should be interpreted. The only way to avoid this problem 
is to consider sequences of local alternatives to the data generation 
process implied by M1 which are in the direction of the data 
generation process implied by M2. However, this involves making 
the type of distributional assumption which the use of GMM was 
designed to avoid. 


Ghysels and Hall [1990c] propose an alternative approach to 
testing based on whether one model can explain the results in the 
other.?” More specifically, the data are said to support M1 if 


T 
rs yy Zo4Ug(Xt,0or) — Ey [Z2xt2(a2, O27)] (4-23) 
t=1 
is zero allowing for sampling error. To implement the test it is 
necessary to know or be able to estimate the expectation term 
in (4-23). Unfortunately, this will typically involve specifying the 
conditional distribution of x, and so is unattractive for the reason 
mentioned above.”® Ghysels and Hall [1990c] develop a test based 
on approximating the expectation using quadrature based methods, 
but we omit the details here. 


Both these statistics are clearly focusing on the overidentify- 
ing restrictions alone. It is possible to extend Ghysels and Hall’s 
[1990c] approach to tests of whether M1 can explain the identifying 
restrictions in M2. Such a test would focus on whether the solution 
to the identifying restrictions in M2 is equal to the value predicted 
by M1. In other words, it would examine 


bor B E, (627) . 


26 For example, Ghysels and Hall [1990c] show that a model constructed by 
taking a convex combination of the data generating processes for x¢ implied 
by M1 and M2 does not typically possess an Euler equation of the form in 
(4-21). 

27 This general approach is often refered to as the encompassing test principle; 
see Mizon and Richard [1986]. 

28 Furthermore Ghysels and Hall [1990c] show that a misspecification of this 
distribution can cause their statistic to be significant. 
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However, it would suffer from the same drawbacks as mentioned 
above and so we do not pursue such a test here. 


Neither of these approaches is really satisfactory. Singleton’s 
[1985] test is only really appropriate in the limited setting where 
(4-21) is the Euler condition of a meaningful model. Ghysels and 
Hall’s [1990c] test is always appropriate but requires additional 
assumptions about the distribution, and once these are made, it 
is more efficient to use maximum likelihood estimation.?® This 
contrasts with the more successful treatments of the hypotheses 
in Sections 4.2—4.5. In these earlier cases, the partial specification 
caused no problems, but it clearly does for non~nested hypothe- 
ses. In one sense, these results are more important because they 
illustrate the the potential limits to inference based on a partially 
specified model. 


4.7 Conditional Moment and 
Hausman Tests 


In sections 4.2, 4.3 and 4.5 it is assumed that it is desired to 
test hypotheses about the population moment conditions upon 
which estimation is based. This mirrors the majority of empir- 
ical applications mentioned in the introduction. In these types 
of application, the model is only partially specfied and so it is 
desireable to base estimation on as much relevant information as 
possible. Therefore all available moment conditions tend to be used 
in estimation.2° However, if the distribution of the data is known 
then the most asymptotically efficient estimates are obtained by 
using maximum likelihood. From a GMM perspective, maximum 
likelihood amounts to estimation based on the score function of the 
data. So, in this case there is no advantage to including any other 
moment conditions implied by the model. These other moment 
conditions are not redundant because they can be used to test 
whether the specification of the model is correct. In this section 
we describe two approaches to constructing this type of test; these 


29 Although, full information maximum likelihood may be more computation- 
ally burdensome; see Ghysels and Hall [1990c]. 

3° The choice of moment conditions may be limited by other factors such as 
data availability or computational constraints. 
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are conditional moment tests and Hausman tests. We examine each 
in turn. Our aim here is to contrast these tests with those described 
earlier, and not to provide a comprehensive review. 


4.7.1 Conditional Moment Tests 


Newey [1985b] and Tauchen [1985] independently introduce a gen- 
eral framework for conditional moment testing based on maximum 
likelihood estimators. To illustrate this framework, suppose that 
the conditional probability density of z, given {x4_1, Z¢-2,...21} is 
Pt(Xz3 90), and so the score function is 


EL (6o)] = 0 , 


where L,(8)) = 8 log(p:i(x£:;00))/60. As mentioned above this is 
moment condition upon which estimation is based. Now assume 
that if this model is correctly specfied then the data also satisfy the 
(q x 1) population moment condition E[g(x;,40)| = 0. Therefore 
one way to assess the validity of the model is to test 


Ho: E|g(x:, 8o)] = 0 
against the alternative 

Ha: Elg(xe,90)| #0. 
This hypothesis can be tested using the statistic 


T T 
CMr =T"S h (ry Q7 Y he (Or) , (4-24) 
t=1 t=1 


where 
hi(0) = (g(xe, 9)’, £.(8)')', 


Qr is a consistent estimator of limr_... Var{h:(9o)] and Or is the 
maximum liklihood estimator. Under Ho, CMr converges to a Xe 
distribution. The statistic has a similar structure to the overiden- 
tifying restrictions test but there is an important difference. Since 
E[|g(x:,8o)] = 0 is not used in estimation, the statistic has power 
against any violations of Ho; see Newey [1985b]. In spite of this, 
some caution is needed in the interpretation of the results. While a 
rejection of Ho implies the model is misspecified, a failure to reject 
only implies that the assumed distribution exhibits this particular 
characteristic of the true distribution. 
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The choice of g(.) varies from model to model. For example, 
in the normal linear regression model, g(.) often involves the third 
and fourth moments of the error process; see Bowman and Shenton 
[1975]. White [1982] suggests that one generally applicable choice 
is to base g(.) on the information matrix identity, 


E[L,(0o)L4(00)'] = — ElðL:(00)/30'] 


because if the the null hypothesis cannot be rejected then conven- 
tional formulae for Wald, LR and LM statistics are valid. Conse- 
quently, this approach has been explored in many settings; see, for 
example, Chesher [1984] and Hall [1987]. Various other examples 
are provided by Newey [1985b] and Tauchen [1985]. 


4.7.2 Hausman Tests 


We have already encountered the Hausman [1978] test principle in 
testing hypotheses about subset of moment conditions. However, 
the basic idea is more widely applicable and we now briefly dis- 
cuss its use in the context of maximum likelihood estimation. As 
we have seen, the test compares two estimates of the parameter 
vector: one is consistent under the null hypothesis that the model 
is correctly specified — this is the maximum likelihood estimator 
(MLE) in our context — and the other is consistent under both 
null and the alternative. The exact nature of this alternative will 
become apparent below. Let @p once again denote the MLE and 
Or be the GMM estimator of 6) based on E[f(x;:,4)| = 0. The 
Hausman test statistic is 


Hr = T (êr — br) (Pr - %) (Gr - Gr) , 


where Vr and Vr are consistent estimators of the asymptotic covari- 
ance of 0r and Or respectively. Under the null hypothesis that max- 
imum likelihood is based on the correct model and E[f (x+, 8o)] = 0 
then Hausman [1978] shows that Hr converges to a x? distribution. 
The alternative hypothesis is that the model is not correctly spec- 
ified but nevertheless Ef (£, 0o)] = 0. Just as in Section 4.3, this 
statistic can be interpreted in terms of the identifying restrictions. 
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White [1982] shows that Hr is ; asymptotically equivalent to a sta- 
tistic based on T712 Z, L,(67). Since L;(.) is (p x 1) it follows 
that T-!/2 77, L,(@r) contains exactly the same information as 


PV OT s L,(9r) , (4-25) 
t=1 
where P = M(M'M MM, M = VT" 9L7(6r)/06", Lr(0) = 
SZ L0) and V, = T EZ Lelr)Llðr). A comparison of 
(4-24) with (4-3) indicates that the Hausman statistic is using the 
information in E[f (zz, 0o)] = 0 to test the validity of the identifying 
restrictions for maximum likelihood estimation. 
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Chapter 5 


FINITE SAMPLE PROPERTIES OF 


GMM ESTIMATORS AND TESTS 


Jan M. Podivinsky 


Although GMM estimators are consistent and asymptotically nor- 
mally distributed under general regularity conditions, it has long 
been recognized that this first-order asymptotic distribution may 
provide a poor approximation to the finite sample distribution. In 
particular, GMM estimators may be badly biased, and asymptotic 
tests based on these estimators may have true sizes substantially 
different from presumed nominal sizes. 


This chapter reviews these finite sample properties, from both 
the theoretical perspective, and from simulation evidence of Monte 
Carlo studies. The theoretical literature on the finite sample behav- 
ior of instrumental variables estimators and tests is seen to provide 
valuable insights into the finite sample behavior of GMM estimators 
and tests. 


The chapter then considers Monte Carlo simulation evidence 
of the finite sample performance of GMM techniques. Such studies 
have often focussed on applications of GMM to estimating partic- 
ular models in economics and finance, e.g., business cycle models, 
inventory models, asset pricing models, and stochastic volatility 
models. This survey reviews and summarizes the lessons from this 
simulation evidence. 


The final section examines how this knowledge of the finite 
sample behavior might be used to conduct improved inference. 
For example, bias corrected estimators may be obtained. Also, 
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properly implemented bootstrap techniques can deliver modified 
critical values or improved test statistics with rather better finite 
sample behavior. Alternatively, analytical techniques might be used 
to obtain corrected test statistics. 


5.1 Related Theoretical Literature 


It is well recognized that the GMM estimator can be considered 
as an instrumental variables (IV) estimator. For, example, Hall’s 
[1993] comprehensive introduction to GMM proceeds by first outlin- 
ing the (asymptotic) properties of the IV estimator in the (static) 
linear regression model, and then extending the analysis to the 
properties of GMM in dynamic nonlinear models. Since the con- 
ventional IV estimator is a special case of GMM, this suggests that 
the extensive literature (see Phillips [1983] for an early survey) on 
the theoretical properties of IV techniques may be a useful guide 
to understanding the properties of GMM. 


One aspect of this theoretical literature is a comparison of 
IV based methods (including Two Stage Least Squares (2SLS) 
and Generalized Instrumental Variables Estimator (GIVE)) and 
maximum likelihood based methods (Limited Information Maxi- 
mum Likelihood (LIML) and Full Information Maximum Likeli- 
hood (FIML)). In the context of single equation estimation, one 
disadvantage of 2SLS relative to LIML is that the 2SLS estimator is 
not invariant to normalisation. Hillier [1990] demonstrated that the 
2SLS estimator is identified only with respect to its direction, and 
not its magnitude, and then argued that this feature is responsible 
for distorting the finite sample behavior of 2SLS. We will see later 
how this argument might be used to suggest an alternative GMM 
estimator. 


Recently, there has been considerable interest in examining the 
consequences of IV estimators when the instruments are “weak”, 
i.e., poorly correlated with the endogenous regressors. Nelson and 
Startz {1990b] evaluated the exact small sample distribution of the 
IV estimator in a simple leading case of a linear regression frame- 
work. Among their findings, they find that the central tendency of 
the IV estimator is biased away from the true value, and suggest 
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that the asymptotic distribution of IV is a rather poor approxima- 
tion to the exact distribution when an instrument is only weakly 
correlated with the endogenous regressor, and when the number of 
observations is small. They also find that the IV estimator may 
be bimodal, but Maddala and Jeong [1992] emphasize that this is 
primarily a consequence of Nelson and Startz’s [1990b] covariance 
matrix design, and not entirely due to the weak instrument prob- 
lem. This evidence of the importance of the “quality” (i.e., degree 
of correlation with endogenous regressors) of the instruments in 
IV estimation suggests that similar considerations are likely to be 
important with GMM estimation. 


Staiger and Stock [1997] undertake a rather more extensive 
analysis of this “weak” instruments problem. They develop an as- 
ymptotic framework alternative to conventional asymptotics (both 
first-order and higher-order) to approximate the distributions of 
statistics in single equation IV linear models. Staiger and Stock find 
that their results provide guidance as to which of several alternative 
tests have greatest power and least size distortions in the case of 
weak instruments. In addition, they provide several alternative 
methods for forming confidence intervals that are asymptotically 
valid with weak instruments. Although they confine themselves to 
considering only the linear IV framework, it would be constructive 
to extend their analysis to the GMM framework. Stock and Wright 
[1996], in a related contribution, develop a similar asymptotic dis- 
tribution theory framework applied to GMM, and suggest that it 
can help explain the non-normal finite sample distributions that 
they uncover. 


In a related study, Pagan and Jung [1993] explore the con- 
sequences for IV and GMM estimation of the weak instruments 
problem, by focussing on particular measures or indices (including 
but not restricted to the degree of correlation between instruments 
and regressors) that usefully explain the performance of these esti- 
mators. They use their suggested framework to analyse three par- 
ticular examples, and argue that their indices, considered jointly, 
are useful in highlighting problems of estimators in these examples. 


In a Monte Carlo study arising out of Nelson and Startz’s 
[1990b] analysis, Nelson and Startz [1990a] investigate the prop- 
erties of t-ratios and the chi-squared test for the overidentifying re- 
strictions in the context of linear instrumental variables regressions, 
under weak instrument conditions. When the instruments are only 
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weakly correlated with the endogenous regressors, they find that 
the overidentification chi-square test tends to reject the null too 
frequently compared with its asymptotic chi-squared distribution, 
and that coefficient t-ratios tend to be too large. Ogaki [1993] 
has argued that Nelson and Startz’s [1990a] results for t-ratios 
seem to be counterintuitive because an anticipated consequence 
of poor instruments would be a large equation standard error in 
estimation, and hence (ceteris paribus) a low t-ratio. Nevertheless 
(Ogaki argues), Nelson and Startz’s {1990a] results may be expected 
to carry over to tests based upon nonlinear IV estimation, rather 
closer to the framework of GMM estimation and testing. 


5.2 Simulation Evidence 


In recent years, there has been a small but growing literature on 
the small sample behavior of GMM estimators and tests, partly 
reversing the state of affairs recently summarized by Ogaki [1993, 
p. 480] statement: 


“Unfortunately there has not been much work done on the 
small sample properties of GMM estimators.” 


Here we consider this evidence (primarily simulation-based in 
nature) with respect to several main classes of models which have 
seen extensive empirical applications of GMM. 


We consider the following general notation. We denote the qx 1 
vector of population moment conditions: 


E[f (£+, 9)] = 0, 


where ĝo denotes the (unknown) true value of the p x 1 vector of 
parameters, where q > p. The q x 1 vector of empirical moment 
conditions based upon a sample of T observations on the data vector 
x; is: 


fr(@) = BY F(e8). 


The standard GMM approach is to minimize the criterion function: 


Qr(9) = fr(9)'Wrfr(9) 
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(depending upon a q x q positive-definite weighting matrix Wr) by 
an appropriate choice of 6, leading to the GMM estimator: 


bp = argmin,Qr(6). 


Provided q > p, we can test the validity of (q — p) overidentifying 
restrictions using the statistic: 


J =T fr(6p)'Wrfr(9r) , 
where 6%. denotes the asymptotically efficient GMM estimator of 0 
(based upon an appropriate choice of weighting matrix Wr), and 
Wz denotes an estimate of the optimal weighting matrix. Under 
the null hypothesis that E[f(x:,@9)] = 0, the goodness of fit J 


test statistic is asymptotically distributed as a x?(q — p) random 
variable. 


5.2.1 Asset Pricing Models 


The earliest, and perhaps the most extensive, area of application 
of GMM techniques is that of asset pricing models, motivated by 
the original papers of Hansen [1982] and especially Hansen and 
Singleton [1982]. In the first published simulation based studies 
of the small sample behavior of GMM techniques, Tauchen [1986] 
and Kocherlakota [1990] examine applications to data generated 
from artificial nonlinear consumption based asset pricing models. 
Tauchen [1986] applied the so-called two-step GMM estimator. 
This starts with a sub-optimal choice (an identity matrix, perhaps) 
of the weighting matrix Wr to provide an estimate (by minimizing 
the appropriate criterion function) of 0, and thus an estimate of 
the optimal weighting matrix. These are then used in the second 
stage to provide (again by minimizing the criterion function) the 
asymptotically efficient estimator 65. of 0. He concluded that the 
GMM estimators and test statistics he considered have reasonable 
small sample properties for data produced by simulations for a 
consumption based capital asset pricing model (C-CAPM). These 
findings related to a relatively small sample size (either 50 or 75 
observations). Tauchen [1986] also investigated the small sample 
properties of (asymptotically) optimal instrumental variables GMM 
estimators. He finds that the optimal estimators based upon the 
optimal selection of instruments often do not perform as well in 
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small samples as GMM estimators using an arbitrary selection of 
instruments. 


Kocherlakota [1990] considers artificial data from a model very 
similar to that used by Tauchen [1986], but allows for multiple 
assets and different sets of instruments. However, instead of using 
the two stage GMM estimator his analysis uses the iterated GMM 
estimator, obtained by iterating between estimates of the parameter 
vector 9 and the weighting matrix Wr until convergence is attained. 
For a sample size of 90 observations, Kocherlakota finds that GMM 
performs worse with larger instrument sets, leading to downward 
biases in coefficient estimates, and confidence intervals that are too 
narrow. In addition, he finds evidence that the chi-squared J test 
for overidentifying restrictions tends to reject the null too frequently 
compared with its asymptotic size. The apparent discrepancy be- 
tween some of the findings of Kocherlakota [1990] and Tauchen 
[1986] may be related to the problem of poor “quality” instruments 
discussed previously in the context of IV estimation. Based upon 
this evidence, it seems sensible to recommend using a relatively 
small number of instruments rather than the often large number 
of instruments arising out of the use of an arbitrary selection of 
instruments. 


Ferson and Foerster [1994] examine models for differing num- 
bers of asset returns using financial data (of between 60 and 720 
observations), and subject to nonlinear, cross-equation restrictions. 
They study both the two stage and iterated GMM estimators. They 
find that the two stage GMM approach produces goodness of fit test 
statistics that over—reject the restrictions. In contrast, the iterated 
GMM estimator has superior finite sample properties, producing 
coefficient estimates that are approximately unbiased in simple 
models. However, the estimated coefficients’ asymptotic standard 
errors are underestimated. This can be at least partially corrected 
by using simple scalar degrees-of-freedom adjustment factors for 
the estimated standard errors. In more complex models (e.g., as 
the number of assets increases), however, they find that both the 
estimated coefficients and their standard errors can be highly unre- 
liable. Tests of a single premium model have higher power against 
a two premium, fixed beta alternative than against a C-CAPM 
alternative with time varying betas. Low power in the direction 
of some alternatives is seen as an inevitable consequence of the 
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generality of the GMM approach to inference. Finally, the pre- 
viously documented over-rejection of the J test of overidentifying 
restrictions is confirmed when using the two stage GMM estimator, 
but when using the iterated GMM estimator the J test tends to 
under-reject marginally relative to its nominal size. 


Hansen, Heaton and Yaron [1996] consider (inter alia) both 
the two stage and iterated GMM estimators in a framework similar 
to that used by Tauchen [1986] and Kocherlakota [1990], together 
with alternative choices of the weighting matrix Wr. They find that 
both the two stage and iterated GMM estimators have small sample 
distributions that can be greatly distorted (even for a sample as 
large as 400 observations), resulting in over-rejections of the J test 
of overidentifying restrictions, and unreliable hypothesis testing and 
confidence-interval coverage. Based upon this evidence, they rec- 
ommend an alternative form of GMM estimator, to be considered 
later. 


5.2.2 Business Cycle Models 


Christiano and den Haan [1996] conduct a Monte Carlo simulation 
investigation of the finite sample properties of GMM procedures 
for conducting inference about statistics that are of interest in 
the business cycle literature. Such statistics (based upon GMM 
estimators) include the second moments of data filtered using either 
the first difference or Hodrick—Prescott filters, and include statistics 
for evaluating model fit. Both types of statistics are commonly used 
in business cycle studies. In their investigation they take care to 
use what they consider as empirically plausible data generating 
processes (DGPs) for their data. Examples include a four—variable 
vector autoregression (VAR) in (the logarithms of) consumption, 
employment, output and investment. Their results indicate that in 
a sample the size (T = 120) of quarterly postwar U.S. data, the 
existing asymptotic distribution theory is not a reliable guide to 
the behavior of the procedures considered. 


In an assessment of the small sample properties of GMM based 
Wald statistics, Burnside and Eichenbaum [1996] consider as alter- 
native DGPs both a vector white noise process, and an equilibrium 
business cycle model. With T = 100 observations, they find that 
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in many cases the small sample size of the Wald test exceeds its as- 
ymptotic size and increases sharply with the number of hypotheses 
being tested jointly. Burnside and Eichenbaum [1996] argue that 
this is mostly due to difficulty in estimating the spectral density 
matrix of the residuals, and conclude that they are very skeptical 
that the problem can be resolved by using any of the alternative 
nonparametric estimators of this matrix that have recently been 
discussed in the literature (see Chapter 3). This evidence of the 
importance of correctly estimating the spectral density matrix of 
the residuals mirrors similar findings by Christiano and den Haan 
[1996]. However, the properties of Wald statistics can be improved, 
often substantially, by using estimators of this spectral density ma- 
trix that impose prior information embodied in restrictions implied 
by either the model or the null hypothesis. 


Finally, Gregory and Smith [1996] present two applications to 
US post-war data of business cycles that may be defined or mea- 
sured by parametrizing detrending filters to maximize the ability 
of a business cycle model to match the moments of the remaining 
cycles. In this way a business cycle theory can be used to guide 
business cycle measurement. In one application, these cycles are 
measured with a standard real business cycle model, and in the 
second, they are measured using information on capacity utilization 
and unemployment rates. Gregory and Smith [1996] then use sim- 
ulation methods to describe the properties of the GMM estimators 
and to conduct exact inference. 


5.2.3 Stochastic Volatility Models 


Changes over time in the variance or volatility of some financial 
asset return can be modelled using the class of stochastic volatility 
(SV) models. This approach is based on treating the volatility o, 
of returns y; as an unobserved variable, for instance: 
Yt = OUt UN i.t.d.N (0, 1) 
the logarithm of which is modelled as a linear stochastic process, 
usually an autoregression, e.g., 
log of = Bo + bı log o$ + w, 


where u; ~ i.i.d.N (0,02) independent of 1. Note that this lognor- 
mal parameterisation is commonly employed because it is consistent 
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with evidence of excess kurtosis or “heavy tails” in the uncondi- 
tional distribution of returns. 


Ruiz [1994] analysed the asymptotic and finite sample proper- 
ties of a Quasi-Maximum Likelihood (QML) estimator (based on 
the Kalman filter) of such an SV model, and applied this to an 
exchange rate model. Although based upon normality, the QML 
estimator can still be employed when the SV model is generalized 
to allow for distributions with heavier tails than the normal. Ini- 
tial evidence by Ruiz [1994] appeared to suggest that this QML 
estimator displayed considerable advantages relative to the GMM 
estimator. In particular, the relative efficiency of the QML estima- 
tor when compared with estimators based on the GMM appeared 
to be quite high for parameter values often found in empirical 
applications. However, Andersen and Sørensen [1997] [see also 
Ruiz [1997]] point out a flaw in Ruiz’s [1994] calculations of the 
asymptotic standard errors of the GMM estimator. Andersen and 
Sgrensen [1997] describe a practical procedure for arbitrarily precise 
calculation of the GMM asymptotic standard errors, and apply this 
to correct Ruiz’s [1994] calculations. In some cases they find the 
correct results are some orders of magnitude different from Ruiz’s 
previously published figures; this places the relative efficiencies of 
GMM and QML estimators on a more even footing. 


Jacquier, Polson and Rossi [1994] develop alternative Bayesian 
techniques for the analysis of stochastic volatility models in which 
the logarithm of conditional variance follows an autoregressive 
model. They use a cyclic Metropolis algorithm to construct a 
Markov chain simulation tool. Simulations from this Markov chain 
can be shown to converge in distribution to draws from the posterior 
distribution, thus enabling exact finite sample inference. One con- 
venient byproduct of their Markov chain simulation method is the 
exact solution to the filtering /smoothing problem of inference about 
the unobserved variance states. In addition, their method means 
that multistep—ahead predictive densities can be constructed that 
reflect both inherent model variability and parameter uncertainty. 
After illustrating their method to analyzing daily and weekly data 
on stock returns and exchange rates, Jacquier, Polson and Rossi 
[1994] conduct sampling experiments to compare the performance 
of their Bayes estimators to both GMM and QML estimators. 
They conclude that for both parameter estimation and filtering, 
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the Bayes estimators outperform both alternative approaches, and 
GMM weakly dominates QML. 


Andersen and Sørensen [1996] examine alternative GMM pro- 
cedures for estimation of a stochastic autoregressive volatility model 
by Monte Carlo simulation methods. To maintain comparability 
with Jacquier, Polson and Rossi’s [1994] evidence, they use an 
experimental design similar to that used by Jacquier, Polson and 
Rossi [1994], but extended by allowing larger sample sizes (T = 
500, 1000, 2000, 4000 and 10000) more representative of those often 
used in applications of GMM to high-frequency financial data. 
Andersen and Sørensen [1996] find a clear trade-off between the 
number of moments, or the amount of information, included in 
estimation and the quality, or precision, of the objective function 
used for estimation. Generally, they find it is not optimal to 
use many moments in estimation if the sample size is relatively 
limited. Also, their results suggest that using less volatile and 
lower—order moments dominates using more volatile and higher- 
order moments. Furthermore, Andersen and Sgrensen use an ap- 
proximation to the optimal weighting matrix to explore the impact 
of the weighting matrix Wr for estimation, specification testing and 
inference procedures. Provided these guidelines concerning choice 
of moments relative to sample size are followed, the chi-squared 
J test for overidentifying restrictions is reasonably well behaved. 
While Andersen and Sørensen indicate many further finite sample 
issues that deserve attention, they argue that their results provide 
guidelines that help achieve desirable small sample properties in a 
wide range of economic settings characterized by strong conditional 
heteroskedasticity and correlation among the moments. 


Dahlquist [1996] includes a simulation analysis of the small 
sample properties of the GMM estimators in his application of al- 
ternative interest rate processes estimated for Denmark, Germany, 
Sweden, and the UK, using the GMM estimator. In line with 
previous evidence from the study by Chan, Karolyi, Longstaff, and 
Sanders [1992] on US data, there seems to be a positive relation be- 
tween interest rate level and volatility for some countries. However, 
Dahlquist finds that mean-reversion plays an important role for 
the specification of the interest rate dynamics. He finds that these 
results seem to be robust to the use of different moment conditions, 
and simulations of the estimated models reveal that they are also 
reasonably successful at capturing non-fitted moments as weil. 
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Flesaker [1993] provides empirical tests of two commonly used 
models for pricing fixed-income derivative securities, using the con- 
stant volatility version of the Heath, Jarrow, and Morton model, 
which is also the continuous time limit of the Ho and Lee model. 
He finds that this model can be rejected using a GMM test for most 
subperiods of three years of daily data for Eurodollar futures and 
futures options. He also conducts a limited simulation analysis of 
the small sample properties and power of the GMM framework. 


5.2.4 Inventory Models 


Fuhrer, Moore and Schuh [1995] compare GMM and maximum 
likelihood (ML) estimators of the parameters of a linear—quadratic 
inventory model using both US nondurable manufacturing data and 
Monte Carlo simulations. GMM estimates for five normalizations 
vary widely, but generally reject the model, while the ML estimates 
differ, generally supporting the model. Their Monte Carlo simula- 
tion experiments reveal a possible explanation for this difference. 
They find that the GMM estimates are often biased, statistically 
insignificant, economically implausible, and dynamically unstable, 
while the ML estimates are generally unbiased (even in misspec- 
ified models), statistically significant, economically plausible, and 
dynamically stable. They attribute the poor performance of GMM 
to the weak instrument problem. However, in findings reminiscent 
of those of Ruiz [1994], they find that asymptotic standard errors 
for ML are 3 to 15 times smaller than for GMM, which suggests 
that Andersen and Sgrensen’s [1997] method for arbitrarily precise 
calculation of GMM asymptotic standard errors might usefully be 
employed. 


West and Wilcox [1996] consider IV (and GMM) estimation of 
a scalar dynamic linear equation with a conditionally homoskedas- 
tic moving average disturbance. Such equations arise regularly 
in empirical work on inventory data (see West [1996], and West 
and Wilcox [1994]). They compare two parameterizations of a 
commonly used IV estimator to one that is asymptotically optimal 
in a class of estimators that includes the conventional one. Based 
upon some plausible DGPs (including that used in West and Wilcox 
[1994]), the optimal estimator is found to be more efficient asymp- 
totically, and they recommend this estimator. Their simulations 
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indicate that in samples of size typically available (T = 100 or 
300), asymptotic theory describes the distribution of the parameter 
estimates reasonably well but that test statistics occasionally are 
poorly sized. 


5.2.5 Models of Covariance Structures 


Models of covariance structures cover a variety of different areas of 
empirical applications. These include studies of contract models, 
consumption, and the effects of individual endowments on earnings 
and education. In such models of covariance structures, when GMM 
minimizes the weighted distance between a vector m of sample 
second moments and the vector f (0) of population second moments, 
it is commonly known as the optimal minimum distance (OMD) 
estimator: 


Bomp = argmin,g(m — FOV Om — f(9)), 
where © denotes the estimated covariance matrix of m. 


Altonji and Segal [1996] examine the small sample properties 
of the OMD (i.e., GMM) estimator applied to models of covariance 
structures, and find that the OMD estimator fob is almost al- 
ways biased downwards in absolute value. This bias arises because 
sampling errors in the second moments are correlated with sam- 
pling errors in the weighting matrix used by OMD. Furthermore, 
they find that OMD is usually dominated by the equally weighted 
minimum distance (EWMD) estimator: 


bewmp = argming(m — f(0))' (m — f(9)). 


Altonji and Segal [1996] also propose an alternative estimator that 
is unbiased and asymptotically equivalent to OMD, but their Monte 
Carlo simulation evidence indicates, however, that it, too, is usually 
dominated by EWMD. 


Clark’s [1996] study examines the small sample properties of 
GMM and maximum likelihood (ML) estimators of nonlinear mod- 
els of covariance structure, thus extending the work of Altonji and 
Segal [1996]. Clark considers the properties of GMM and ML 
estimates for both a single factor model (the Hall and Mishkin 
[1982] model of consumption and income), and a simple structural 
VAR-type error model. His simulation analysis based upon either 
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100 or 200 observations shows that optimally weighted GMM esti- 
mation can yield biased parameter estimates. In addition, a model 
specification test based on GMM estimation exhibits size substan- 
tially greater than the asymptotic size. Finally, these problems are 
reduced when the number of overidentifying restrictions in a model 
is reduced. 


5.2.6 Other Applications of GMM 


Finally, we consider a range of other applications of GMM which 
include some evidence of small sample behavior of GMM estimators 
and tests. We omit any consideration of various studies of small 
sample properties of GMM statistics in panel data models, as these 
are covered separately in Chapter 8. 


Ni [1997] investigates the consequences for GMM estimation of 
scaling factors are often used to restore stationarity in Euler equa- 
tion residuals, when data exhibit exponential trends. He demon- 
strates that finite sample estimates are sensitive to the scaling 
factors, and even seemingly plausible scaling factors may produce 
quite spurious estimates. One suggestion might be to choose scal- 
ing factors such that the scaled marginal utility remains roughly 
constant. Ni illustrates this by estimating a representative agent’s 
time nonseparable utility function, using both artificial data, and 
aggregate consumption and asset returns. 


Lee and Lee [1997] applied GMM estimation to the truncated 
or censored regression model. In a typical limited dependent vari- 
able regression, the conditional mean of the error distribution is 
no longer zero, so this cannot be used as a moment condition in 
GMM estimation. However, by appropriate symmetric trimming of 
the error distribution, we can obtain a zero conditional mean, and 
hence obtain the moment or orthogonality conditions required for 
GMM estimation. In an application of this symmetrically trimmed 
GMM estimator to a model of recreational demand, the bootstrap 
is used to simulate the finite sample distribution of this estimator. 


The consistency of the GLS estimator in static models with 
pooled cross section time series data is often tested by a Hausman 
test. Ahn and Low [1996] reformulate the Hausman test based on 
a GMM approach and find that it incorporates and tests only a 
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limited set of moment restrictions. They also consider an alter- 
native GMM statistic incorporating additional restrictions, which 
has power toward additional sources of model misspecification. 
Their Monte Carlo simulations demonstrate that while both the 
Hausman test and their alternative test have good power detecting 
endogenous regressors, the alternative dominates the Hausman test 
if coefficients of regressors are nonstationary. 


Non-parametric tests for duration dependence are considered 
by Mudambi and Taylor [1995]. Using finite sample critical values 
obtained by Monte Carlo methods, their results applied to UK 
business cycle data for 1854-1992 are remarkably consistent, for 
tests based on the method of moments, GMM, and a statistic whose 
null distribution probability limit is zero. However, they find that 
the null distribution of the GMM test statistic for samples of the 
size considered is distinctly non—normal, so that asymptotic critical 
values give erroneous results. 


5.3 Extensions of Standard GMM 


Since the evidence summarized in the previous section indicates 
that often there can be problems with the behavior of standard 
GMM statistics in small samples, this has motivated researchers 
to develop and investigate various refinements or extensions to 
standard GMM techniques. This section examines some of the 
most promising extensions, initially discussing estimation, and then 
testing. 


5.3.1 Estimation 


Hansen, Heaton and Yaron’s [1996] analysis of two stage and it- 
erated GMM estimators applied to asset pricing models was con- 
sidered in the previous section. Because they (and others) find 
that these two standard GMM estimators often exhibit poor small 
sample properties, they introduce and examine an alternative GMM 
estimator, namely the continuous updating GMM estimator. 
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Both the two stage and iterated GMM estimators take the 
weighting matrix Wr as fixed in each stage of the estimation pro- 
cedure (i.e., when minimizing the GMM criterion function). The 
continuous updating GMM estimator changes this weighting matrix 
as 0 is changed within the minimization. Thus the continuous 
updating GMM estimator is defined as: 


6% = argmin, fr(9)'Wr(9) fr(0) , 


where the weighting matrix W7(@) now varies with 0. Thus the 
first-order conditions for this minimization problem with respect 
to 0 now have an additional term arising from the weighting matrix 
W7,(6), thus adding to the complexity of the numerical optimisa- 
tion. However, this does not alter the limiting distribution of the 
estimator. Hansen, Heaton and Yaron [1996] refer to Pakes and 
Pollard [1989] for a formal justification of the asymptotic properties 
of this continuous updating GMM estimator, and point out that it 
has the added advantage of being invariant to scaling of the moment 
conditions (see the related discussion of Ni [1997] in the previous 
section. 


An added advantage of this continuous updating GMM estima- 
tor is that it matches the recommendations of Magdalinos [1994] 
with respect to dominance of methods that change weighting matri- 
ces to embody restrictions. This is because the continuous updating 
GMM estimator is analagous to the LIML estimator, whereas the 
two stage and iterated estimators are closer to 2SLS: as we pointed 
out earlier, LIML has various advantages over 2SLS. 


Hansen, Heaton and Yaron [1996] investigate the small sample 
properties of this continuous updating GMM estimator, and find 
that it typically dominates the two stage and iterated GMM esti- 
mators (at least within the asset pricing model framework they 
consider). However, they caution that large sample inferences 
occasionally are unreliable. This is related to the finding that 
the continuous updating GMM estimator is often approximately 
median unbiased (like the LIML estimator), but sometimes exhibits 
extreme outlier behavior. They conclude that this estimator should 
be useful in many applications of GMM where standard GMM 
methods have been found to be deficient. 


Since the small sample performance optimally weighted GMM 
estimators has been found to be poor in some applications (despite 
its desirable asymptotic properties), Kitamura and Stutzer [1997] 


Extensions of Standard GMM 143 


propose a computationally simple alternative, for weakly depen- 
dent DGPs, based on minimizing the Kullback—Leibler Informa- 
tion Criterion. They derive conditions under which the asynptotic 
properties of this estimator are similar to GMM (i.e., consistency, 
asymptotic normality, and the same asymptotic covariance matrix 
as GMM). They also suggest tests of overidentifying and paramet- 
ric restrictions as alternatives to analogous GMM test procedures. 
Although they report no small sample evidence, they identify this 
as an important topic for further research. 


Imbens [1997] proposes other alternatives to standard GMM 
estimators, and shows that they have several advantages. An ini- 
tial estimate of the weighting matrix is no longer required, and it 
becomes straightforward to derive the distribution of the estimator 
under general misspecification. Some of the alternative estimators 
he proposes turn out to have appealing information—theoretic in- 
terpretations, and have distributions that can be approximate by 
saddlepoint approximations. The principal disadvantage is that of 
computation: the system of equations that has to be solved is of 
greater dimension than the number of parameters of interest. 


In a related paper, Smith [1997] examines alternative semi- 
parametric quasi-likelihood approaches to GMM. These embed 
sample versions of the moment conditions used in GMM into a 
non-parametric quasi—likelihood function by use of additional pa- 
rameters associated with these moment conditions. It is possible 
to define both specification and misspecification tests which are 
similar to classical tests and are first-order equivalent to the corre- 
sponding GMM test statistics. He explores the structure of this 
semi-parametric quasi-maximum likelihood estimator for models 
estimated by instrumental variables. 


Finally, it is worth mentioning Kitamura and Phillips’ [1995] 
development of a limit theory for instrumental variables (IV) es- 
timation that allows for possibly nonstationary processes. They 
derive fully modified (FM) versions of several estimators, includ- 
ing GMM and the generalized instrumental variables estimator 
(GIVE). They investigate by simulation methods the small sample 
properties of the following estimation procedures: ordinary least 
squares, crude (conventional) IV, FM-IV, FM-GMM, and FM- 
GIVE. They find that among FM-IV, FM—GMM, and FM-GIVE, 
when applied to the stationary coefficients, FM—GIVE generally 
outperforms FM-IV and FM—GMM by a wide margin, whereas the 
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difference between FM-IV and FM—GMM is quite small whenever 
the autoregressive roots of the stationary processes are quite large. 
However, when applied to the nonstationary coefficients, the three 
estimators are numerically very close. 


5.3.2 Testing 


Smith [1992] proposed non-nested tests for competing models es- 
timated by GMM, based upon the Cox and encompassing prin- 
ciples. His results apply to non—nested linear regression models 
with heteroskedasticity and serial correlation of unknown form and 
differing instrument validity assumptions. Arkonac and Higgins 
[1995] examine the finite sample properties of these tests using 
Monte Carlo simulations. The size and power of the tests are 
found to depend upon the degree of heteroskedasticity, the degree 
of correlation between regressors and instrumental variables, the 
error distribution, the sample size and the number of regressors. 


Since various simulation experiments suggest that tests based 
on GMM estimators and their asymptotic critical values often have 
true sizes that differ greatly from their nominal sizes, Hall and 
Horowitz [1996] investigate conditions under which the bootstrap 
provides asymptotic refinements to the critical values of t-tests 
and tests of overidentifying restrictions, particular in the case of 
dependent data. Their simulation results show that the bootstrap 
usually reduces the errors in size that occur when critical values 
based on first-order asymptotic theory are used. The bootstrap also 
provides an indication of the accuracy of critical values obtained 
from first-order asymptotic theory. 


Zhou [1994] proposed alternative GMM tests that are analyt- 
ically solvable in many econometric models. In particular, this 
provides analytical GMM tests for asset pricing models with time- 
varying risk premiums. He provides simulation evidence showing 
that the proposed tests have good finite sample properties and 
that their asymptotic distribution is reliable for the sample size 
commonly used. 
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5.4 Concluding Comments 


The literature surveyed in this chapter demonstrates two key fea- 
tures. First, there is considerable evidence that the finite sample 
properties of GMM estimators and test are often not well approxi- 
mated by conventional asymptotics. This conclusion holds over the 
wide range of different models, and areas of economics and finance, 
that have seen application of GMM techniques. Second, there is a 
lively and important current literature suggesting and investigating 
extended GMM procedures for estimation and testing. Many of 
these new developments have yet to be fully evaluated, either in 
practice or with respect to their finite sample behavior. We can 
expect to see important contributions to this topic of research. 
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Chapter 6 


GMM ESTIMATION OF TIME SERIES 


MODELS 


David Harris 


In time series analysis, the basic univariate model is the autore- 
gressive moving average (ARMA) one. The estimation of ARMA 
models has been the subject of a vast literature over many years. 
If a pure autoregressive (AR) model is considered then ordinary 
least squares (OLS) estimation is appropriate and is asymptotically 
equivalent to maximum likelihood when the errors are normally dis- 
tributed. However, the introduction of moving average (MA) com- 
ponents to the model complicates the estimation problem because 
the least squares criterion is no longer linear in the parameters. 
Both least squares and maximum likelihood estimation for models 
involving MA terms involves numerical optimisation and is rela- 
tively computationally difficult. As a result, a variety of techniques 
for the estimation of models with MA terms have been suggested 
that do not involve numerical optimisation. These techniques have 
generally made use (implicitly or explicitly) of moment conditions 
implied by the ARMA model, and therefore fall within the class of 
GMM estimators. This chapter has two aims. The first is to provide 
an introduction to some of these moments—based estimators. The 
second is a pedagogic one to illustrate the general theory of GMM 
presented in Chapter 1 as applied to a relatively simple time series 
model. 


An outline of the chapter is as follows. In Section 6.1 we discuss 
the estimation of pure MA models. For simplicity we focus mostly 
on first order MA models, and indicate how extensions to higher 
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order models follow. In Section 6.2, we consider the estimation 
of ARMA models, with particular emphasis on the estimation of 
AR coefficients by instrumental variables methods. The result is 
a computationally simple moments—based estimation procedure for 
ARMA models. In Section 6.3 we investigate how these methods 
can be applied to testing for unit roots. 


6.1 Estimation of Moving Average Models 


In this section we discuss the estimation of pure moving average 
models of the form 


Ye = Et + Oy E41 +...+ Aone 


where £; is an i.i.d. zero mean error process with variance o? and 
fourth order cumulant «x4. With the extra assumption that €, is 
normally distributed, maximum likelihood estimation of 61,...,0, 
is possible, but requires numerical maximization of the likelihood. 
For this reason, there has been considerable interest in finding sim- 
pler estimators that have properties approaching those of maximum 
likelihood, and some of these simpler estimators can be put in the 
GMM framework. 


In the following section, we give a simple estimator derived 
directly from the moments implied by an MA(1) model, but we 
show that it generally has poor properties. We then describe 
a popular approach to the estimation of moving average models 
which makes use of approximate autoregressive models that can be 
estimated by OLS regression. The properties of these estimators 
can be very good. Finally, we indicate how the methods can be 
extended to moving average models of general order. 


6.1.1 A Simple Estimator of an MA(1) Model 


We consider the estimation of 4 in the model 


Ye = Et + Ooer-1, (6-1) 
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where we assume that c; ~ i.i.d.(0,02) and |69| < 1. It is straight- 
forward to show that the first order autocorrelation implied by this 


model is 

E(yyt-1) — 8o 
Ely?) 1+0 
It is also the case that all higher order autocorrelations are zero, 


but these additional moment conditions are not used here. If we 
define the sample first order autocorrelation 


Po = (6-2) 


pe Dies UV , 
Di Y 
then the replacement of unknown parameters by sample estimators 
in Equation (6-2) suggests we solve the quadratic 


62, — ppp +1=0 


to obtain the estimator br. In order for the solution of this qua- 
dratic to be real we require |r| < 0.5, and it is also the case that 
the true first order autocorrelation satisfies | 9| < 0.5. Thus, given 
the consistency of fr, it follows that Pr(|jpr| < 0.5) — 1. However, 
in a finite sample it is possible that |r| > 0.5, particularly if |69| is 
near one. To provide an estimator of 6) that is always real valued 
we could define 


pr; if lPr| < 0.5, 


—0.5, if pr < —0.5, 
pr = 
0.5, if pr > 0.5, 


and 67 to be the quadratic solution 
2Pr 
It can be seen that the finite sample distribution of 6r consists of a 


mixture of two discrete probability masses at +1 (which disappear 
as T — oo) and a continuous distribution for —1 < 67 < 1. 


An estimator for of can then be found from the relationship 
E(yg) = oo(1 + 9). (6-3) 


Using appropriate sample quantities we have 
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We can put this estimator into the GMM framework defined in 
Chapter 1. If we let the parameter vector be 7 = (0,07)’ and define 
2 
=f Yeye-1 — 0°O 
f(y) == (3 sy) ’ 


then it follows from (6-2) and (6-3) that Ef(y:,m) = 0. The 
sample moments are 


T = T 
= TY 1 YY- — 070 ) 
m 1 = t=1 Yt¥t 
fr(n) T DF (vom) eG Z, y- 07(1 + 6?) ’ 


and solving the exactly identified equation fr(#7r) = 0 for Hr gives 
the estimator 67 defined above, and G? defined analogously to 6?. 
We can also define 77 = (Or, 62)’. 

The asymptotic properties of these estimators are summarized 


in the following theorem, the proof of which is given in the Appen- 
dix of this chapter. 


THEOREM 6.1 


Both fr and fp are consistent and both VT (fr — no) and 
VT (ñr — no) is asymptotically normal with mean 0 and 
asymptotic covariance matrix 
1 1+ 62+ 465+08+68  —20203(2 + 62 + 64) 
(1 — 62)? \ —20263(2+ 02 +04)  2o3(1 — 203 + 305 + 206) 


0 O 
+(5 =) 


It is of interest to consider the asymptotic efficiency of Ôr in 
particular. Its asymptotic variance is 
1+ 62 + 465 + 63 + 68 
(1 — 66)? 
as compared to the asymptotic variance of the maximum likelihood 
estimator under the normality assumption, which is 1 — 62. If 
6) = 0 then it can be seen that 6p is as asymptotically efficient 
as the maximum likelihood estimator, but as 4) departs from zero 
Or rapidly becomes less efficient than maximum likelihood. This es- 
timator and its generally poor properties appear in Hannan [1960], 
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(p. 47-48), see also Hannan [1970], (p. 373-374). This provides a 
simple example of the fact that (G)MM can provide quite inefficient 
estimators, and it is important to examine the properties of any new 
estimator derived. 


6.1.2 Estimation by Autoregressive Approximation 


A commonly used method for the estimation of MA(1) models 
(and, more generally, ARMA models) is to use an AR model to 
approximate the time series and derive estimates of the MA(1) 
model from the estimates of the AR model. The idea was suggested 
by Durbin [1959] and many modifications and extensions have since 
been made. 


In the case of the invertible MA(1) model (6-1), we have the 
AR representation 


Ye = > ay (90) Yei + Et (6-4) 
j=l 
where 
a; (8) se (-6)’ » j=1,2,... (6-5) 


for any 9. We use the notation a,;(@) to show explicitly that the 
autoregressive coefficients are functions of 6. Since (6-4) is of 
infinite order, it is necessary in practice to approximate it by an 
AR(k) model 


k 
ye = >> a; (90)ye—5 + Ert, (6-6) 
j=l 

in order to obtain any estimates. Note that the error term 
Ert = Et + Dzar y(Oo)yey = Er + (—1)=+ 08 "erri in (6-6) 
is different from that in (6-4), reflecting the extra error made by 
approximating the AR(oo) by an AR(k). However, since |9| < 1, 
this error decreases as k increases. For the asymptotic results that 
follow, it is necessary to assume that k is chosen in such a way that 
k — œ as T — œ, so that (6-6) approximates (6—4) increasingly 

well as the sample size increases. 
In order to put these ideas into the GMM framework, we regard 


(6-1) as a structural model and (6-6) as a reduced form model 
whose role is to summarize the moments of the MA(1) process. In 
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particular, the structural model has implications for the second mo- 
ments of y, (i.e., the autocovariances) and the AR model captures 
these moments. That is, we use an AR model to summarize the 
second order properties of y, instead of (the first element of) the 
autocovariance function as in the previous section. The connection 
between the parameter of interest 6) and the reduced form para- 
meters a;(A9) is given by (6-5), and these relations can be used to 
define estimators for 6) based on estimators for a; (0). 


For any 6, we define a k x 1 vector 
a (8) 
A,(9) = : ; 

or (8) 
with a;(@) given by (6-5). The vector of true autoregressive coef- 
ficients in (6-6) may therefore be written A;(0). Similarly, let A; 
represent the k x 1 vector of OLS estimators @,...,@, in (6-6). 
The estimation strategy is, for some value of k, to choose Or, to be 


the value of 8 that minimizes the difference between Â, and A;(0). 
That is, we can define 


Or = argmingce (â = A,(8)) VIk (â — Ax(6)) ; (6-7) 


where © = (—1,1) and V7, is a k x k weighting matrix. For a given 
k, different estimators can be constructed by using different sub- 


vectors of (â, — Ax(8)), and more generally, by choosing different 


weighting matrices. The following well known estimators are of the 
form of (6-7). 


6.1.2.1 The Durbin Estimator 


Durbin [1959] suggested an estimator that follows naturally from a 
re-arranged form of (6-5). We can write (6-5) as 


a; (6) = —Oa;-1(9), j = 1,2, Siete 


with ao(@) = —1. This may be interpreted as an exact autoregres- 
sive relationship for a;(@). Hence we could estimate ĝo by regressing 
the estimates (@1,...,@,)’ from (6-6) on a “lag” of themselves, i.e., 
(1, @,...,@,-1)’. The estimator is thus 
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with @ = 1. Durbin [1959] showed that 6p is consistent and 
asymptotically normal with asymptotic variance (1—82). Hence this 
estimator is asymptotically as efficient as the maximum likelihood 
estimator. 


To express 6p in terms of (6-7), let B,(@) = I, +OL;,, where Ly 
is a k x k matrix with ones on the first lower off-diagonal and zeros 
elsewhere. Then @p results from (6-7) when V7; = B;,(6)'B,(9). 


6.1.2.2 The Galbraith and Zinde-Walsh Estimator 


Galbraith and Zinde-Walsh [1994] made use of autoregressive ap- 
proximations to estimate MA models of any order. For the first 
order case, their approach is to make use of relationship (6-5) only 
for j = 1. That is, they choose bozw to minimize the difference 
between Q, and a;(0) = 9, which results in the estimator 
czw = a. 

In terms of (6-7), this implies that Vz, is chosen to be a k x k 
matrix of zeros except for element (1,1), which is set to one. This 
results in a very simple estimator, but one which is not always 
asymptotically efficient. It is shown by Galbraith and Zinde-Walsh 
[1994] that the asymptotic variance of 0gzw is one for any 6). Thus 
bezw is asymptotically efficient for 9) = 0, but loses efficiency 
relative to 9p and maximum likelihood as 6) departs from zero. 
The loss of asymptotic efficiency relative to ĝp may have been 
expected because only 1 of k possible moment conditions are used 
in constructing Ogzy. 


6.1.3 A Finite Order Autoregressive Approximation 


The equations (6-5) apply to the infinite order autoregressive repre- 
sentation (6-4). In practice we must use the finite order autoregres- 
sive approximation (6-6), and rely on the assumption that k — oo 
as T — oo to claim that (6-5) will provide an accurate relationship 
between the autoregressive and moving average parameters. An 
alternative approach, suggested by Ghysels, Khalaf and Vodounou 
[1994] is to replace (6-5) by equations derived by Galbraith and 
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Zinde- Walsh [1994], (equation 10) assuming a fixed and finite lag 
length k. That is, we use 


oe Q2(k-5+1) 
a;(0) = —(-0 7 — aay (6-8) 


to construct A,(@), and 67, be found from (6-7). Clearly (6-8) 
converges to (6-5) if k is let to go to infinity. It is possible that the 
inclusion of the additional terms in (6-8) that are functions of k will 
improve the finite sample performance of the estimator by explicitly 
acknowledging the finite order autoregression used. A disadvantage 
of the approach is additional computational complexity compared 
to the Galbraith and Zinde-Walsh and Durbin estimators above. If 
we follow the Galbraith and Zinde-Walsh approach and estimate 
o using only the first autoregressive coefficient, then the sample 
version of (6-8) with j = 1 rearranges to give 
& 02+) = g2r+1 + ĝ — ay = 0. 


That is, it is necessary to solve a 2(k + 1) order polynomial to 
obtain the estimator, making the approach less convenient than 
those above. 


6.1.4 Higher Order MA Models 


The first order moving average estimators of the previous sections 
can be extended to higher order MA models with little conceptual 
difficulty. Given the poor properties of the estimator in Section 
6.1.1 we only consider estimators based on autoregressive approxi- 
mations. 


We consider the MA(m) model 
Ye = Et + O10Et-1 +... + OmoEt-m, 


and its infinite autoregressive representation (6-4) where 0) = 


(010, - - -, Omo) and the relationships between a;(@) and 8 are given 
by (see Galbraith and Zinde-Walsh [1994]) 
min(j,m) 
a;(9) = — 5 6,0;-:(9), j = 1, 2, sees (6-9) 
i=1 


We again make use of the approximate AR(k) model (6-6) with 
k — co as T — oo. It is necessary that k be at least as large as m 
in any application in order to identify 0. 
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The Galbraith and Zinde- Walsh [1994] approach takes the first 
m equations in (6-9), replaces the parameters with sample estima- 
tors and solves for the m x 1 vector base For a given k, it is 
straightforward to obtain formulae for the elements of Basa: For 
example, for an MA(3) model we have 


Oezw1 = ai, 

bezwe = âz + ai, 

Bags = 3 + 201A + as. 
In general, czw is not asymptotically efficient because it uses only 
a fixed number of an expanding number of moment conditions. 


The Durbin [1959] approach uses all of @,...,@,. Sample 
counterparts to the equations (6-9) are available for j = 1,...,k. 
These equations imply the m order autoregression in @; 


a; = —ĝ â; SS de E ea Bn Gj —m 
estimated by OLS over j = 1,...,k with @; = 0 for 7 < 0. The re- 
sulting coefficient vector 9p = (8p1,.-.-,;9pm)’ is an asymptotically 


efficient estimator of 6. 


6.2 Estimation of ARMA Models 


We now consider the estimation of ARMA models. We first dis- 
cuss the estimation of the AR coefficients of an ARMA model by 
instrumental variables methods. These estimators clearly fall into 
the class of GMM estimators. 


6.2.1 IV Estimation of AR Coefficients 


In some applications, the estimation of the AR coefficients of an 
ARMA model may be of primary interest, and the MA coefficients 
may be nuisance parameters. For example, suppose a first order 
autoregressive signal is observed with an additive white noise mea- 
surement error: 

Y= Oe + Ut 


de = Ahi- + Ue, 
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where u, and v; are independent i.i.d. processes. These equations 
can be combined to give 


Ye — AYe-1 = Ut + Vt — QVUi-1, 


which is equivalent in its moments to an ARMA(1,1) model with 
the MA coefficient being a function of a and the variances of u 
and v;. In this case we are only interested in the autoregressive 
parameter a and the moving average parameter is a nuisance para- 
meter reflecting the presence of the measurement error. This type 
of application arises in engineering system identification, see Stoica, 
Soderstrom and Friedlander [1985] for example. Useful application 
of the following methods can also be made in unit root testing as 
discussed in Section 6.3 and also as part of a method of estimating 
the full ARMA model as discussed in Section 6.2.3. 


6.2.2 The ARMA(1,1) Model 


For clarity, we treat the ARMA(1,1) model in some detail. The 
model is 

Ye = Boyr-1 + Et + OoEt-1, (6-10) 
and we assume that Go Æ —4 and that y; is stationary and invert- 
ible (that is |G9| < 1 and |4| < 1). For now, we are concerned only 
with the estimation of o. We view the model as a regression 


Ys = Poyr-1 + Ut, (6-11) 
where u is the autocorrelated error term 
Ut = Et + OoEt-1- 


Note that the OLS estimator of 6o in (6-11) (ignoring the moving 
average structure of u) is inconsistent because E(usyz:_-1) # 0. That 
is, the regressor is correlated with the error term. This can be seen 
by noting that 


Yy = 5 Pius; (6-12) 
j=0 


by back substitution in (6-11). Then E(uy-1) = E(uru-1) = 
026). Therefore, we consider the instrumental variables estimation 
of (6-11). 
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6.2.2.1 A Simple IV Estimator 


Note that the MA(1) structure of u; implies that E(u,u,_;) = 0 for 
all j > 2. Furthermore, (6-12) implies that 


E(urye_;) = 0 for all j > 2. (6-13) 
These moment conditions can be used to identify and consistently 
estimate o. Just one of these moment conditions is sufficient to 


identify 8), and the natural one to choose is E(usy:-2) = 0. We 
then have the moment conditions E f (y+, 89) = 0, where 


f(y, B) = (y: ~ BYi-1)Yt-2- 


Then fr(8) = T Eis (y: — By-1)y:-2 and solving fr(Gr) = 0 for 
Br yields 


-1 r 


T 
Br = (>: Hates) De Ye-2Yt- 
t=3 t=3 


Of course, this is just the simple IV estimator of (6-11) using y,_2 
as an instrumental variable for y:_1. 


We have the following asymptotic properties. 


THEOREM 6.2 


Br is consistent and /T (ĉr — Bo) is asymptotically nor- 
mally distributed with mean 0 and variance 


1= 2 
TERM ny (E+ 4M + ABH + 48095 + 28508 + 6). 


Notice that the asymptotic variance of Br becomes infinite when 
Bo = —9o. If Bo = —O then y is equivalent to a white noise process 
and both 6o and 4 are not identified. This possibility is excluded 
by our assumptions, but the asymptotic distribution suggests that 
if Bo is near —O) then Gr may have a large variance. 


For the purposes of comparison, the asymptotic distribution of 
the maximum likelihood estimator of the autoregressive parameter 
in an ARMA(1,1) model with normally distributed errors is 


z d (1 + Go90)*(1 — 85) 
vT (Buzz = Bo) >N (0, ee) , 
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(see Brockwell and Davis [1993], (p. 260) for example). As would be 
expected, the asymptotic variance of Bu LE is smaller than that of 
Br. For values of 6o and ĝo near zero, Bri is almost as asymptotically 
efficient as Bu LE- However, for larger values of 6o and 6o, Bu LE is 
considerably more asymptotically efficient (especially if Gy is close 
to —@ and close to one in magnitude). Note that Buss may also 
have a large variance if { is close to —O. 


6.2.2.2 Optimal IV Estimation 


In the preceding section we used yw,» as an instrument for y_1. 
It is possible to use any one of y%_;, j = 2,3,... as an admissable 
instrument for y,_;, and to construct estimators 


A T T r 
bri = & a) > YeYt—j 
t=j+1 t=j+1 

for various values of j > 2. However, it is shown by Dolado 
[1990] that Br; has asymptotic variance equal to 6o 20-2) times the 
asymptotic variance given in Theorem 1.2 above. Since || < 1 by 
assumption, it follows that Bro is the most asymptotically efficient 
of all of the estimators Bry. Intuitively, this result is expected. In 
general, the asymptotic efficiency of an IV estimator is positively 
related to the correlation between the stochastic regressor (Y+—ı 
in this case) and its instrumental variable (y,_;). Since ys is a 
stationary ARMA process, it follows that the correlation between 
Yz-1 and y,_; decreases as j increases, so it is preferable to choose 
j as small as is admissable, that is 7 = 2. 


We can also consider estimators that make use of more than 
one instrument for y,-; implied by the moment conditions (6-13). 
This was considered for the estimation of ARMA models of any 
order by Stoica, Soderstrom and Friedlander [1985]. Suppose we 
make use of the first q consecutive moment conditions implied by 
(6-13): 

Ef (Ye, Bo) = 0, 
where now 


Ff (Ye B) = Yoe-2 (ye — Pye-1) 


is a q x 1 vector of moments and Y} = (yt,.-. themida) is ag x 1 
vector of instrumental variables. Following Definition 1.2, we define 


Bra = argmingQr(£), 
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where 
Qr(8) = fr(B)'Arafr(A), 
and Ar, is a positive definite O,(1) weighting matrix. This leads 


to 
Bra -( X j Yt-1 Y; t-2AT4 ` } Yeates] X 


t=q+2 t=q+2 


T T 
5y Ye-1Y7 t-24T4 pD Yqt—24t- 


t=q+2 t=q+2 
The admissability of Y, 1-2 as a set of instruments for y%_1 guar- 
antees the consistency of Bra. In order to give the asymptotic 
distribution of Bra, we define the q x 1 vector R, = E (Yz t-2Y1-1). 
The j** element of R, is 


02Gi-} (1 + Boo) (Bo + 9) 
nate 2% 
and T-t ee 42 Vot-2¥t-1 > Ry. The asymptotic distribution of 
Bra is shown by Stoica, Soderstrom and Friedlander [1983] to be 


VT (Bra — Bo) È N (0, 02(R, Ag Ra)" R Aa Vg AgRg(RyAgRa)~*) , 
where 


V, = lim Tvar fr(bo). 


The asymptotic efficiency of Bra is a function of the choice of the 
number of instruments (q) and the weighting matrix Ar. The 


optimal choice of Az, for a given q is such that Ar, > V71. Then 
VT (Bra — Bo) Ż N (0,02(RiV;*Rq)). 


Stoica, Soderstrom and Friedlander also show that the term (Rj V7 ' 
R,)~! is a monotonically decreasing function of q and converges to 
a finite non-zero limit, and hence maximum asymptotic efficiency 
is achieved by letting q — œ as T > on. 
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6.3 Applications to Unit Root Testing 


Instrumental variables estimation methods have been used in the 
construction of tests for autoregressive unit roots, and we show that 
they can also be applied in the construction of tests for moving 
average unit roots. 


6.3.1 Testing for an Autoregressive Unit Root 


Pantula and Hall [1991] consider the problem of testing for an 
autoregressive unit root in an ARMA model of known order. Said 
and Dickey [1985] suggested a test based on nonlinear least squares 
estimation of the model, so the instrumental variables approach 
provides a computationally simpler approach that does not require 
numerical optimisation. Other popular methods for testing for au- 
toregressive unit roots such as those of Said and Dickey [1984] and 
Phillips [1987] use semi—parametric approximations to the ARMA 
model and therefore do not exploit the knowledge of the order of 
the ARMA model. Of course, in practice the model order is the 
result of a model selection procedure and the approximate nature 
of the semi—parametric approaches may be less of a disadvantage. 


The model we consider is 


Ay: = ayia + So Bj An; + Ut; (6-14) 


j=1 


where u; is a moving average error of order n: 


Ur = Er + Y bjerg. (6-15) 
j=l 

We assume that the autoregressive lag polynomial (L) = 1 — 
a 3; L? is stationary and that the moving average lag polynomial 
O(L) = 1+ V4, 9; is invertible. The aim is to test the null 
hypothesis that a = 0, in which case y; follows an ARIMA(m, 1, n) 
model. The alternative hypothesis is that a < 0, which implies that 
yı follows a stationary ARIMA(m+ 1,0,n) model. The estimation 
theory for (6-14) is given under the null hypothesis so that the null 
distributions of tests of a = 0 can be found. 


If we were to estimate (6-14) by OLS ignoring the MA structure 
of uz, then the asymptotic distribution of the resulting @ would 
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be a function of nuisance parameters reflecting the autocorrelation 
in the error term and the correlation between the regressors and 
error term (see Phillips [1987] for example). Hall [1989] in the 
pure MA case (m = 0) and Pantula and Hall [1991] in the ARMA 
model given above suggest using instrumental variables for y,, and 
Xi-1 = (Ay-1,--., AYt-m) . They suggest using y+- and X;_, as 
instruments for y+-ı and X;_; respectively. These are admissable 
instruments if k > n. Standard instrumental variables regression 
on 
Ay: = ayr-1 + Xib + u 
using these instruments gives 
(3) = ( D Yi-kYt-1 X Yt-kXi-1 D ( È Yi-kAy ) ; 
Bre DX Kye-1 Po Xi-kXt-1 Xr Ay 

Tests of a = 0 are based on â, and the asymptotic properties 
derived by Pantula and Hall [1991] given in Theorem 1.3 below. To 
present these results we require some notation. Let W(r) denote 


a standard Brownian motion defined on 0 < r < 1, and, following 
Pantula and Hall [1991], define the random variables 


Pe [ W(r)?dr, 
and i 
o=5 (W(1)? -1). 


Also, let B = 1 — $; 8; denote the sum of the autoregressive 
coefficients in (6-14), and let A, = E(Xt-kXt-1). 


THEOREM 6.3 
If a = 0 in (6-14), k > q and A, is non-singular then 
Ta, “> BI-€, 


and 


The consistency of Br follows from the admissability of the in- 
struments (k > q). Since @ is a nuisance parameter vector when 
testing a, it follows that a consistent estimator is all that will be 
required. However, it is possible that more efficient estimators of 
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8B derived from larger sets of instruments for X;_, (for example, 
Zit-k = (Atik, - -< AY—K—-1). for some l > m) may result in 
improved finite sample properties for @,. 


It is the asymptotic distribution of @, that is of interest. Note 
that the random variable T-t has the Dickey—Fuller distribution 
first given by Fuller [1976] for the case where w is i.i.d.. The sum 
of the autoregressive coefficients (B) enters this distribution as a 
nuisance parameter. However, using the consistent estimator Br 
B can easily be removed from the asymptotic distribution to give 
a coefficient test of a = 0. Define B = 1 — Dija Prj- Then the 


test statistic Ta; / B converges in distribution to [-1€ when a = 0, 
and hence critical values can be obtained from the usual tables, see 
Table 10.A.1 of Fuller [1996]. 


It is also possible to construct a t statistic for testing a= 0. A 
t statistic can be defined as 


A 1/2 
tk = 6," ( > Hates] ak, 


t=k+1 
where 
T 
=T! $ a, 
t=k+1 

and 

m ~ 

Up = Aye — ĜkYt-1 — X Bin Aye. 

j=1 
The asymptotic distribution of tx is not free of nuisance parameters 
because it ignores the autocorrelation in us. The statistic must be 


modified using the long run variance of u, which, given its MA(n) 
structure, is given by 


2 
w =a? ( +5 s) , (6-16) 
j=1 


A consistent estimator of w? can easily be constructed if we can 
replace o? and 6;, j = 1,...,n in (6-16) by consistent estimators. 
Pantula and Hall [1991] do not take this option because consistent 
estimation of 6; requires an inconvenient nonlinear least squares es- 
timation. However, we can make use of the more convenient moving 
average estimators in section 6.1 to obtain estimators. Either the 
Galbriath and Zinde-Walsh [1994] or the Durbin [1959] methods 
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can be applied to &, to obtain Dx. and G? and hence @?. It can 
then be shown that when a = 0 


tk = Züge aS rie. 
Wu 


That is ¢, has the same asymptotic null distribution as a Said- 
Dickey or Phillips—Perron test and the usual critical values can be 
used, see Table 10.A.2 of Fuller [1996] for example. 


One difficulty with implementing a test based on tẹ is that 
D k+1 Yt-kYt-1 is not necessarily positive, and hence its square 
root may not be real. Li {1995} suggests replacing this term with the 
asymptotically equivalent term so y?_1, which is always positive. 
An alternative, based on the IV formulae in Chapter 1, is to replace 


T T 2 pe 
x. Yt-kYt-1 with ( X wim) are 
t=2 


t=k+1 t=k+1 


which is also asymptotically equivalent. 


6.3.2 Testing for a Moving Average Unit Root 


We can also consider the problem of testing for a unit root in the 
moving average lag polynomial of a time series. Such a test can 
also be used as a test of the null hypothesis that a time series is 
stationary against the alternative that it has an autoregressive unit 
root. Many papers have been written on this testing problem. We 
consider here the ARIMA(m, 1, 1) model of Leybourne and McCabe 
[1994] 


Ay: = T, 
m 

Ly = X bjt- + Ut, 
j=l 


Ut = Et + O€y-1, 


where G(L) = 1 — $; 8L’ is a stationary autoregressive lag 
polynomial and |6| < 1. We are interested in testing the null 
hypothesis that 6 = 1. If 0 = 1 then qx, has a moving average unit 
root, and it follows that y: is (0). If |@| < 1 then z; is a stationary 
and invertible ARMA(m, 1) process and hence y; is J(1). Therefore 
a test of 0 = 1 is a test that y, has been over—differenced. 
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The test procedure proposed by Leybourne and McCabe [1994] 
involves the estimation of 6;, j = 1,...,m and construction of the 
filtered series 


Yi = Ye — XO Biye-3- 
j=l 


This filtered series is regressed on an intercept and time trend (if 
required) to obtain residuals e,. The test statistic is then 


2 
me Si Ta e;) 
T D e? l 


The asymptotic null distribution of this test statistic has the general 
form 


1 
s/f U(r)*dr, 
0 


where U(r) is a function of the standard Brownian motion W(r) 
that varies depending on the form of de-meaning or de-trending 
carried out on yf. For example, if yf is de-emeaned then U(r) = 
W(r) —rW(1) is a Brownian bridge. Critical values for these tests 
are given in Kwiatkowski et al. [1992]. 


Our interest in this testing procedure lies in the estimation of 
Bij =1,...,m. Leybourne and McCabe [1994] suggest maximum 
likelihood estimation of the ARIMA(m, 1,1) model 


Ay = N Bi Aus +E + OE (6-17) 


j=l 


to obtain B;. However, we do not require an estimate of 0 to con- 
struct the test statistic and we only require consistent estimation 
of 8; for the asymptotic theory to hold. Therefore, for convenience, 
we could use the instrumental variables estimators of Section 6.2 
to estimate (6-17). That is, denoting X;_, = (Ay_i,.-., Ayz—-m). 
and 8 = ((,..-, Bm), we could use 


jis T =i T 
B= ( >; KeaXi.) 5 Xı-2Aye. 


t=m+3 t=m+3 


In this case, we are using X;_2 as a set of instrumental variables for 
X:-1, and one could experiment with larger vectors of instruments. 
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Appendix: Proof of Theorem 6.1 


We first consider the consistency of Ñr. To do this, we check that 
Assumptions 1.1 and 1.2 in Chapter 1 are satisfied. We need not 
check Assumption 1.3 because the weighting matrix Ar specified 
for the general form of the GMM estimator is not required in the 
exactly identified case. We could work with the explicit formulae, 
but the proof is presented in this way for the purpose of examining 
these general assumptions in a simple example. 


For Assumption 1.1, we have 


g(n) = Ef (ye, 8) 


= (cone are) 


_ agbo — 076 
~ RA +0) -o1 + 6?) ) 
The parameter space for this estimation problem is Y : [—1,1] x 


(0,00). We need to show that g(7) is solved uniquely in Y by 
n = no. Setting the first element of g(7) to zero gives 


Substituting this expression for ø into the second element of g(7) 
and setting to zero leads to the quadratic in 0 


If ĝo = 0 then it can be seen that the equation is solved by 8 = 0. 
If 0) Æ 0 then the quadratic has the solutions 0 = 6) and 6 = 1/6, 
of which only 9 = 6 is in Y. With 6 = ĝo we obtain o? = 02. Thus 
Assumption 1.1 is satisfied. 


For Assumption 1.2, we require that fr(7) converges uniformly 
in probability to g(ņ) given above. In this case, we need only 
verify that T-! Di- Y? > E(y?) and T7! Dio wey > E(yeye-s). 
These two results follow from applications of standard weak laws 
of large numbers for stationary time series, so Assumption 1.2 is 
satisfied. In this relatively simple case we do not need to resort to 
any more primitive assumptions in order to verify Assumption 1.2. 
In particular, the structure of the model is such that we do not 
need to restrict the parameter space to be compact. 
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Since these assumptions hold, we can apply Theorem 1.1 to 
conclude that îr is consistent. To show that the real-valued esti- 
mator fjr is consistent we show that fir — îr > 0. With respect to 
6r, we see that Or = Or if ér| < 0.5. Since Pr (lêz <0. 5) >l, 


we can conclude that Or — Or 2, 0. It is then immediate that 
g2 — G2, 2 0, which shows that 67 is consistent. 


In order to derive the asymptotic distribution of the estimator, 
we consider Assumptions 1.7-1.9 from Chapter 1. The matrix of 
first derivatives of f (y:n) with respect to 7 is 


afln) _ o? e \_ 
yee (209 146) = Fr(n). 


Each element of this matrix is continuous with respect to 6 and o? 
on T, so Assumption 1.7 is satisfied. 


Hence for any 7%, such that n% > mo it follows that 


* T0 Bo T 
Tee a 14a) mee 


For Assumption 1.9, we require a central limit theorem for the 
vector 


_ ( T? Eha (UY — 0280) ) 
VT fr(te) (r ? Di (8? — ol + 0))/ 
That is, we require a central limit theorem for the sample vari- 
ance and first order autocovariance of a first order moving average 
process. This is a special case of well known results for the autoco- 
variances of a linear process (see Anderson [1971], Theorem 8.4.2 
or Fuller [1996], Theorems 6.2.1 and 6.3.6 for example). We have 


VT fr(no) “ N(0, V), 


where the form of V can be deduced from the aforementioned 
theorems to be 


V= jim var VT fr(m) = jim Tx 


var (SE yè) cov as Yi» yo wye-1) = 
cov (Sie Ueda, Dies Y2) var (Eroa Ye¥e-1) 
2o4(1 + 48 + 64) + k40? —4040095 — K49095 
—404 6065 — K40095 oé(1 + 502 + 65) + K402 
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where 6% = (1+ 62). Then from Theorem 1.2 we have that 
VT (fir — no) $ N (0, FVF”). 
Note that the simplified asymptotic covariance matrix is obtained 
because in this case ĝo is exactly identified. We then find 
1 
FVF = — 
(1-6)? * 
1+02+406+08+68 20263(2 + 63 + 65) 
20203(2+ 02 +05) 206 (1 — 262 + 304 + 268 


0 0 
+(9 a 


This expression gives the asymptotic variance of both 77 (which is 
asymptotically real with probability one) and 77. The additional 
term involving the fourth order cumulant «4 reflects the effect of 
any non—normality of e; on the asymptotic variance of G2. 
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Chapter 7 


REDUCED RANK REGRESSION USING 


GMM 


Frank Kleibergen 


Since the mid eighties, alongside the literature arising on GMM, 
a large number of papers emerged on cointegration as well. This 
is due to the fact that cointegration models combine two features 
which many economic time series possess, i.e., random walk in- 
dividual behavior and stationary linear combinations of multiple 
series. 


Cointegration models are essentially linear models with reduced 
rank parameters. The reduced forms of the traditional simulta- 
neous equation models have also this reduced rank property (see 
Hausman [1983]). The estimation techniques used in cointegration 
and simultaneous equation models are therefore very similar. Maxi- 
mum likelihood estimators for both models use, for example, canon- 
ical correlations, (see Anderson and Rubin [1949] and Johansen 
[1991]), and maximum likelihood reduced rank regression therefore 
amounts to the use of canonical correlations and vectors. This 
chapter shows that GMM reduced rank regression amounts to the 
use of two stage least squares (2SLS) estimators. The asymptotic 
properties of the 2SLS estimators used in simultaneous equation 
models are in general identical to the properties of maximum like- 
lihood estimators (see, for example, Phillips [1983]). This chapter 
shows that this also holds for cointegration models. Furthermore, 
the GMM objective function has asymptotic properties which are 
identical to a likelihood ratio statistic for cointegration, the Jo- 
hansen trace statistic (Johansen [1991]), and it can thus be used 
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in a similar way. The similarities between GMM and maximum 
likelihood estimators in reduced rank models are therefore quite 
large. The GMM, however, also allows for the derivation of the 
asymptotic properties in the more complex reduced rank models, 
which is not true for the maximum likelihood estimators. We 
show this for cointegration models with structural breaks in the 
variance and cointegrating vectors. In practice, there is a need 
for the construction of cointegration estimators and test statistics 
which can be applied for these kind of models as a large number of 
economic time series have these properties (like heteroskedasticity 
and structural breaks). There is a need therefore for the devel- 
opment of cointegration estimators and test statistics, which can 
be applied for these models. The main aim of this chapter is to 
introduce GMM estimators for cointegration models incorporating 
heteroskedasticity and/or structural breaks. 


The chapter is organized as follows. In Section 7.1 the relation 
between the GMM-2SLS estimators in cointegration and simulta- 
neous equations models is discussed jointly with the limiting distri- 
butions of the cointegrating vector estimators for a few widely used 
specifications. Section 7.2 shows the limiting distribution of the 
GMM objective function, which can be used to test for the number 
of unit roots/cointegrating relationships. Section 7.3 extents the 
stylized model to a model where a shift of variance occurs after 
a predefined time period. A Generalized Least Squares approach, 
which assumes a priori knowledge of the variance shift moment, is 
used to construct the GMM cointegration estimators and statistics 
that allow for heteroskedasticity. Section 7.4 discusses GMM coin- 
tegration estimators and statistics that allow for a change in the 
cointegrating relationship and/or multiplicator. Both extensions 
can be further generalized to more shifts and, also, more moment 
conditions can be added. Finally, Section 7.5 concludes. 


Note that the following definitions are used throughout this 
chapter; = indicates weak convergence; integrals are taken over 
the unit interval unless indicated otherwise; when possible without 
confusion, integrals like f W(t)dt are shortly denoted as f W. The 
theorems are derived assuming Gaussian disturbances, which can 
be relaxed (see, for example, Phillips and Solo [1992]). 
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7.1 GMM-2SLS Estimators in 
Reduced Rank Models 


7.1.1 Reduced Rank Regression Models 


Reduced rank regression models are characterized by the lower 
column or row rank of a parameter matrix. Two well known 
models with this property are the Error Correction Cointegration 
Model (ECCM) and the INcomplete Simultaneous Equations Model 
(INSEM). The ECCM is specified as (see Engle and Granger [1987] 
and Johansen [1991]) 


Az; = of’ Tı + Et, (7-1) 


where q; : k x 1, t =1,...,T; a,8:k xr; B' = (IL. -p3); and & 
is Gaussian white noise with covariance matrix X. For simplicity 
higher order lags are neglected. The INSEM reads (see Chapter 1 
and Hausman [1983]) 


Yit = By + Yeu + En (7-2) 

Yor = Hartie + IoT + Ezt, 
where Yit : Mi X 1, Yat : Ma X l, Zir : ki X 1, £u : ka X 1, t = 1,..., T; 
B : m Xx m; y: ky x my; Ig, : Mm x ki; Mas : Ma X ko. The 
disturbances €,; and €2; are assumed to be Gaussian white noise 
with covariance matrix Q. The variables 71; and £o are assumed 
to be (weakly) exogenous. The INSEM (7-2) is identified when 
the number of excluded exogenous variables from the first set of 
equations, k2, is at least as large as the number of equations in the 
second set, M2, k2 > maz. 


Time series generated by ECCMs exhibit random walk pat- 
terns for individual series while the joint behavior of the series 
has stationary linear combinations (see Engle and Granger [1987]). 
In the case of ECCM (7-1) these stationary linear combinations 
are represented by (’x;, which is why 8 is called the cointegrating 
vector. These properties are observed in many economic time series 
which explains the popularity of cointegration in applied work. As 
a traditional simultaneous equation model, INSEMs explain series 
which are generated simultaneously. These types of economic time 
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series have been analysed since the late fourties, see, for example, 
Anderson and Rubin [1949]. 


The reduced rank property of both of these models is obtained 
when they are specified as restrictions of the standard linear model, 


Zt = Hw + Ut. (7-3) 


Both the ECCM and the INSEM are restricted versions of 
(7-3). The ECCM is obtained when z = Ar, HI = af!’ = 
_ 7 
a way. dia as bx a = a nale the ANSEM ai: 
Q21 —a21 b3 
tained when we substitute the equation of yx in the equation of yı: 
which then results in (7-3) with x = Yt , W = a) , U = 
: i : ae Tat 
Er + Bo€2e Il = Pitry Aia . The reduced rank 


, 


E2t Mı I2 

structure of the ECCM is obvious while the INSEM has a reduced 
rank structure when y = 0 since the first set of rows of II is a 
linear function of the other rows in that case. The reduced rank 
properties of both models are different, however, as in the ECCM 
the last set of columns is a linear combination of the first set, while 
in the INSEM the first set of rows is a linear combination of the last 
set. This also implies that the GMM estimators of the parameters 
of these two models are different. 


7.1.2 GMM-2SLS Estimators 


The reduced rank property of the INSEM and ECCM implies that 
the parameters of these models cannot be estimated using the 
method of moments. This results from the fact that the moment 
condition of model (7-3), 


fr™) = ED, wu) = 0, (7-4) 


does not lead to unique estimator of the parameters as the number 
of parameters in II, r(2k — r) in case of the ECCM and (m—1)(k+ 
1)+k, in case of the INSEM, is less than the number of equations in 
(7-4), k? (see, also, Chapter 1). The GMM is designed to estimate 
the parameters of a model overidentified by the moment conditions. 
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Instead of the moment conditions (7-4), the GMM minimizes a 
quadratic form of it with respect to a weighting matrix Ar 


Qr(9) = vec( fr(II(9)))’ Arvec( fr (II(9))) 


= "3 uS wiwi)! Q E)vee(S> usw!) X 
t=1 t=1 t=1 (7-5) 


T T T 
= tr om ww) (D ww)? Ox usw)’, 


where 0 = (a, (2) in the case of the ECCM and 6 = (8, y, Iiz1, I2) 

in the case of the INSEM. We choose as weighting matrix Ar = 
T 

((X ww)! @ Uo), as this enables us to use Qr(0) as a test 
t=1 

statistic as well to test hypotheses on 0. 


To construct the first order conditions for a minimum of Q7(9), 
the specification of the first order derivatives of II with respect to 
the different elements of 0, are needed. These expressions read, for 
the ECCM, 


iret) asl: (7-6) 
eeg =-(88 Å), 
and for the INSEM, 
s =-((Ty Ike)’ ® C7 )) 
ie) = (Ii 1e (72) 
y O Ine P )). 
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These derivatives are substituted in the first order condition for a 
minimum of Q7(@), 


dQr(9) 


a ~ 0e 
L ðu, , Z naisip Aoo, 
TČ o O ww) @ B-?)vee(S> uw) =0 $ 
t=1 t=1 t=1 
Avec(II 2 
(AD) yeo X uw) = 0. 
t=1 
(7-8) 


For the parameters of the ECCM these first order conditions then 
read, 
T 


(Ir ® a)'vec(S~* 5 wuz, i) =0 8 
T T r (7-9) 
œ Tata) O T142) ala ETa) = B, 
t=1 t=1 
T 
(8 8 Ip} vee( E! Y us, 1) =0 6 
z (7-10) 


T T 
(D5 Ara, PEY tarab =, 
t1 t=1 


while for the parameters of the INSEM these first order conditions 
read, 


(Ia Tx)’ @ (y Dve Sa =0s 


t=1 


T T 
(Cla M22) pE LY) (Iz: Moz) DEAC — iu) = bz, 
Š 5 (7-11) 
T 
(Ik, 078 oe Drean Sudas 
s (7-12) 


T T 
(So tuta) YO auly — baya) =N, 
t=1 t=1 
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(I, ® & Bs 1 vec(D7! See =06 
as (7-13) 


oy nies a,2,)~* = Tl. 
t=1 t=1 


The first order conditions for the INSEM (7-11)-(7-13) lead to the 
well known 2SLS estimator as (7-13) shows that II, can be esti- 
mated independently from (2 and y1. The resulting least squares 
(GMM) estimator of II, is then used to construct estimators for 82 
and yı (GMM-2SLS estimators, see also Chapter 1). 


The first order conditions for the ECCM (7-9)-(7—10) show that 
the GMM estimators of a and £ in the ECCM both depend on each 
other. As we did not restrict a and £, they are not identified. If we 
specify 8 as, 8 = (I, — (})’, both a and 6z are properly identified. 
When this specification of 8 is used, a consistent GMM estimator 
of a is, 


T T 
@ =(D5 Ave (1 — t-l) toe 1 81) oe-1) 44-1) 
ii a (7-14) 


T 
o> Fiz-1(1 — meal), E-121) Lat-1) 24-1) 
t=1 t=1 


where x; = Ce , ty :7 X 1, zx : (k—r)x 1. This estimator 


consists of the first r columns of the least squares estimator of I 
in (7-3). 


If @ is substituted in the first order conditions of the cointe- 
grating vector (3, (7-9), the resulting cointegrating vector estimator 
B automatically satisfies the identifying restrictions on £. B is 
then the GMM-2SLS estimator of the cointegrating vector 8. In 
a Bayesian analysis this GMM-~2SLS estimator equals the mean 
of the conditional posterior of @ given a, when a diffuse prior 
is used (see Kleibergen and van Dijk [1994a]). The estimators 
@ and ĝ in (7-9) and (7-10) also allow for the construction of 
an iterative estimation scheme which converges to the maximum 
likelihood estimators. Asymptotically the GMM-—2SLS cointegrat- 
ing vector estimator possesses the same kind of properties as the 
maximum likelihood estimator, i.e., superconsistency and asymp- 
totic normality. This is shown in the theorems of the following 
sections. Note, that the asymptotic normality of this GMM-2SLS 
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estimator does not result from Section 1.3.2 of Chapter 1 as the 
data is nonstationary and Assumption 1.9 is violated. 


7.1.3 Limiting Distributions for the 
GMM-2SLS Cointegration Estimators 


As an extensive literature exists on the limiting distribution of the 
2SLS estimator in the INSEM, see, for example, Phillips [1983], 
we focus on the limiting distribution of the GMM-2SLS estimators 
for the ECCM, which is only sparsely discussed in the literature 
(see, for example, Quintos [1994], where the case that w; in (7-5) 
is uncorrelated with z;_; is discussed). Theorem 7.1 states the 
limiting distribution of the GMM multiplicator estimator, &, and 


~ 


the GMM-2SLS cointegrating vector estimator, 8. 


THEOREM 7.1 


When the Data Generating Process (DGP) in (7-1) is such 
that the number of cointegrating vectors equals r (k—r unit 
roots) and a’, 61 is nonsingular, the GMM estimators 


T T 
a= œ Az,(1— aan Ose P41 o¢.1) Loe—1) 244-1) 
t=1 t=1 (7-15) 


T T 
(S ruill = aoe (DS inita) ea) ee) 
t=1 t=1 
and 
7 T T 
B= (D5 tr) O rAr E aaa) (7-16) 
t=1 t=1 


which use the same value of r, have a limiting behavior 
characterized by 


VT(@— a) > N(0, cov(6’r)! @ 5), (7-17) 
3 0 
TĒ- > (agga AY mw Wawa 


0 
= (Was wes) i 
(7-18) 
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where W,, resp. W} are (k — r), resp. r dimensional 
stochastically independent Brownian motions defined on 
the unit interval, Ay = (a Za), Az = (a’E7!a)?, 


O = (8,81) Basar" f WaWy) Ase Bs (B84) 


and X is estimated by the sum of squared residuals, S= 


T T 
p D(An(l -sa(X tit) tA). 


For proof of the Theorem see the Appendix. 


The limiting distributions of the GMM estimators in Theorem 
7.1 do not result from Hansen [1982] as the data are generated 
by a nonstationary process and Assumption 1.9 from Chapter 1 is 
therefore violated. Asymptotic normality of the GMM cointegrat- 
ing vector estimator 8 is obtained as it consists of the projection 
of a standard least squares estimator of II on a space spanned 
by @. Johansen’s Representation theorem (see Johansen [1991]) 
shows that the random walk components of the DGP resulting 
from an ECCM lie in a space orthogonal to a. The projection on 
@, which is a consistent estimator of a, therefore implies that 8 
is asymptotically uncorrelated with the random walk components 
of the DGP. The limiting distribution of the GMM estimator 8 
therefore has the same properties as it would have in the case of a 
stationary DGP, i.e., asymptotic normality, instead of consisting of 
Brownian motion functionals as in case of most random walk DGPs 
(see Phillips [1991]). 


Theorem 7.1 discusses the limiting distribution of the cointe- 
grating vector estimator for the most straightforward case, i.e., no 
further lags in the VAR polynomial and no deterministic compo- 
nents, and shows that it is identical to the limiting distribution 
of the canonical correlation maximum likelihood estimator of Jo- 
hansen (see Johansen [1991]). While adding lags Az, only changes 
the limiting distribution of the cointegrating vector estimator, B, 
in the sense that a’, 3, has to be replaced by œ I'(1)G., where 
T(L)Agv, = af’ay-1 + € and T(L) is a (p — 1) dimensional lag 
polynomial in the case of a VAR(p), inclusion of deterministic com- 
ponents also changes the functional form of the cointegrating vector 
estimator (see, for example, Johansen [1991] and Kleibergen and 
van Dijk [1994b]). Theorem 7.2 states the GMM estimators and 
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their limiting distributions of the multiplicator and cointegrating 
vector for a few commonly used specifications of the deterministic 
components. 


THEOREM 7.2 


When the DGP reads 
Ax, = a(8’a-1 + U) + Er, (7-19) 


and the number of cointegrating vectors equals r (k — r 
unit roots), a’, 3, nonsingular, the GMM estimators 


â= È Az;(1 — Ea ) 3 Ge a yr 
‘ee ate Ty-1(1 — ea ) oF Ga cy) 


t=1 


(7-21) 
és ) Act) a(a’'E—@)-! 


have a limiting behavior characterized by 


VJT(@— a) > N(0, cov(6’x — H’) 8 E), (7-22) 
(7 a) (Ron) > 
( (E ee 9 Wf ae awa; = 


2 where 
N(0,a' 5ta 8 01) }’ 


SEE) o 
When the DGP 
Ax; =c+af'x14+&, (7-24) 
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c = ap’ +a, and the number of cointegrating vectors 
equals r (k — r unit roots), a’, 3, nonsingular, the GMM 
estimators 


a= Yanl- Ge Ole pees 


t=1 


Aa ae A E (25) 


t=1 t=1 


COM ea 
and 
ENE 


5 (7-26) 
2 Ta Azala) 
have a limiting behavior characterized by 
VT(&— a) > N(0, cov(Z'x — p’) 8 5), (7-27) 
( (yaaa o (2-2) 
0 T? #=H 


A 0 -17 Wi Wi í Wu 
=> ( t) (f T T EJ rT |dW:M 
t l l 
=> N(0, Eta 8 02). 
(7-28) 
When the DGP reads 
Ax; = c +a(b'£i1 + 6't) +e, (7-29) 


c = ap’ + a_X’, and the number of cointegrating vectors 
equals r (k — r unit roots), a’, 3, nonsingular, the GMM 
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estimators 


T- Tt—1 ; T Lot-1 Xot-1 d 
â@=(¢_ Az (-| 1 |© 1 1 =o 
t=1 t t=1 t t 


T2t-1 j ; Tt—1 ; 
1 Jy ee oe Peir) 
t t 
(7-30) 
and 
B T (%1\ [%-1\' 
a) =| TET p 
aa ee : (7-31) 
T Tt—1 
œ| 1 Jaspera a 
t=1 t 
have a limiting behavior characterized by 
VT(â@- a) > N(0, cov(f'x — p’ — 8't)! 8 E), (7-32) 
Tk 0 0 B-B 
0 T? 0 ||ĝ-u]|> 
0 0 T?/ \6-6 
2 $ (7-33) 
W. = 
1 -1 a -1 1 
G nwa dW4) As, 
2 
= 


0 
zi ie a’=-ta O e i 


where W,, Wi, and W2 are (k — r), (k — r — 1) andr 
dimensional stochastically independent Brownian motions, 
the value of r used for the GMM estimators equals the value 
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of r for the involved DGP, 


Ag a (Mkaa) oat) Gi ae) E 


, 


I, 
0<t<1, B= ( So) 
Go. = (6.01) PaL (laa) ee 
O: = (8,61) “Ba A YW +A 1a", ba (LBL, 


6 EEIE I 


©; = (6,61) B a A Wa A a babby 


PROOF OF THE THEOREM 


The first and third part of the theorem are natural exten- 
sions of Theorem 7.1. The second part of the theorem is 
proved in the Appendix. 

E 


Theorems 7.1 and 7.2 show the asymptotic normality of the 
GMM cointegrating vector estimator and standard (asymptotic) 
x? tests can therefore be performed to test hypotheses on the coin- 
tegrating vectors (see Phillips [1991]). This extends the asymptotic 
normality of the GMM estimator discussed in Chapter 1 to an 
important class of models. The GMM not only allows for testing 
hypotheses on the cointegrating vectors or multiplicators but also 
on the number of cointegrating vectors. As discussed in the next 
section, the GMM objective function (7-5) can be used for this 
purpose. 
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7.2 Testing Cointegration Using 
GMM-2SLS Estimators 


The GMM objective function (7-5) not only allows us to construct 
the GMM estimators but can also be used to test for the number of 
cointegrating vectors, unit roots (see also Chapter 6). This results 
as the optimal value of the objective function has a specific limiting 
distribution under Ho : r = r*. In Theorem 7.3, the analytical form 
of this objective function is presented for several specifications of 
the deterministic components and their limiting distributions. 


THEOREM 7.3 


When (7-1) is the DGP and the number of cointegrating 
vectors equals r ((k — r) unit roots), substitution of the 
GMM estimators (7-15) and (7-16), using the same value 
of r in the GMM objective function (7-5) leads to a limiting 
behavior characterized by 


Qr(@, Â) = trle f wawy f wwf waw. (7-34) 


When (7-19) is the DGP and the number of cointegrating 
vectors equals r ((k — r) unit roots), substitution of the 
GMM estimators (7-20) and (7-26), using the same value 
of r in the GMM objective function (7-27) leads to a 
limiting behavior characterized by 


Qr(a@, B, n) 


(58) 
When (7-24) is the DGP and the number of cointegrating 
vectors equals r ((k — r) unit roots), substitution of the 
GMM estimators (7-25) and (7-26), using the same value 
of r in the GMM objective function (7—5) leads to a limiting 
behavior characterized by 


aB > eif (P) amy 


JEA Yr ax 


(7-36) 
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When (7-29) is the DGP and the number of cointegrat- 
ing vectors equals r ((k — r) unit roots), substitution of 
the GMM estimators (7-30) and (7-31), using the same 
value of r, in the GMM objective function (7-5) leads to a 
limiting behavior characterized by 


Qr(a, B, ô, C) 


sag (yarog (E (8) aw 


(7-37) 


where x; = (z) Zu iT Xl, ty: (k-r)x l; tı = 
2t 
T 
£ii — ED t, t= ta Et W,, Wi are (k — r), 


(k —r—1) dimensional Brownian motions, W; = i ) , 
12 


W, = Wı — JW, Wu = Wi — J Wn, T(t) =t, u(t) =1, 
0<t<1,7=7-f7, and © is estimated by the residual 
sum of squares for the unrestricted model. 


PROOF OF THE THEOREM 


The first part the proof is given in the Appendix, the other 
parts follow straightforwardly. 
E 


The limiting distributions of the GMM objective function (7—5) 
stated in Theorem 7.3 can be used to test for the number of coin- 
tegrating vectors. By simulating from these limiting distributions, 
the asymptotic critical values can be determined. These values 
are tabulated, for example, in Hamilton [1994]. Qr can then be 
calculated for different values of the number of cointegrating vectors 
r,r = 1, ..., k, such that the values of r can be determined for which 
the associated Qr’s exceed their asymptotic critical values. These 
values of r are then rejected to be plausible values of the number 
of cointegrating vectors. 


Theorems 7.1 to 7.3 show that the limiting distributions using 
the GMM-2SLS estimators are identical to the limiting distribu- 
tions when maximum likelihood estimators are used (see Johansen 
[1991]). As maximum likelihood estimators can be constructed 
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in a straightforward way using canonical correlations, they are 
not computationally more difficult than GMM-2SLS estimators. 
Differences between these two estimators can arise though in their 
small sample distributions and in model extensions as maximum 
likelihood estimators become analytically intractable when more 
complicated models are analyzed. 


In Phillips [1994], it is shown that the canonical correlation 
cointegrating vector estimator has a small sample distribution with 
Cauchy type tails such that it has no finite moments. When we 
neglect the dynamic property of the data and assume fixed regres- 
sors, results from Phillips [1983] indicate that the small sample 
distribution of the GMM-2SLS cointegrating vector estimator has 
finite moments up to the degree (k — r). This degree is determined 
by the £-1@(@’X~1@)~? expression appearing in the cointegrating 
vector estimator B (see also Chapter 5). As 8 is specified such that 
it always has rank r, rank reduction of a8’ implies that a has a lower 
rank value. In that case @’~1@ would not be invertible leading to 
the fat tails in the small sample distribution. So, cointegration tests 
essentially test for the rank of a and can be considered as tests for 
the local identification of 8 and are, therefore, comparable with the 
concentration parameter in the INSEM (see Phillips [1983]). 


The maximum likelihood cointegrating vector estimator is ap- 
pealing as it has a simple expression in the standard case. The 
relation between maximum likelihood cointegrating vector estima- 
tors and canonical correlations is, however, lost when extensions 
of the model are considered. The GMM framework allows for 
the analytical construction of cointegrating vector estimators for a 
general class of models. In the next sections two kind of structural 
break model extensions are analyzed, i.e., structural breaks in the 
variance (heteroskedasticity) and structural breaks in the cointe- 
grating vector and/or multiplicator, for which the cointegrating 
vector maximum likelihood estimators are not of the canonical 
correlation type. 
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7.3 Cointegration in a Model with 
Heteroskedasticity 


Assuming homoskedastic disturbances in (7-1), the maximum like- 
lihood estimator of the cointegrating vector can be constructed 
using canonical correlations. This estimator has a normal lim- 
iting distribution under conditions, which are more general than 
strict homoskedasticity (see, for example, Phillips and Solo [1992], 
where it is shown that the weak convergence is retained in the 
case of conditional heteroskedasticity with constant unconditional 
variances). This weak convergence is, however, lost when the mean 
of the conditional variances changes from period to period. Further- 
more, the relation between the maximum likelihood estimator and 
canonical correlations is lost. A GMM-2SLS cointegrating vector 
estimator can still, however, be constructed when the functional 
form of the heteroskedasticity is known. To show this, we derive 
GMM estimators and their limiting distributions for an example of 
a change of the variance after a predefined period T,. The analyzed 
model thus reads, 


Aa: = a8’ 24-1 + Et, (7-38) 
where 
cov(é;) = 4, #=1,..,% (7-39) 
= Yo. t=T, +1,..T. 


In the next section, the GMM cointegrating vector estimator and 
cointegration test and their limiting distributions are derived using 
a Generalized Least Squares (GLS) framework to account for the 
heteroskedasticity. 


7.3.1 Generalized Least Squares 


Cointegration Estimators 


Assuming that we know the form of the heteroskedasticity, we use 
a different GMM objective function than (7-5) to derive the GMM 
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estimators. 


Ti T 
Qr(a, 8) =vec(S> Erer t JO Dzer) 
t=1 t=T,4+1 
Tı 


T 
S (mra OE Y (ter 8 


t=1 t=Tı+1 


T T 
vec(X Er'es a t JO Ez eta) 
t=1 t=T,+1 

(7-40) 
In the next theorem the GMM estimators and their limiting distri- 
butions jointly with the limiting distribution of the optimal value of 
the GMM objective function are stated. Note, that the estimators 
and limiting distributions of Theorem 7.4 can also be extended to 
more variance shifts and other moment conditions for the variances 
can be incorporated as well. 


THEOREM 7.4 


When the DGP in equations (7-38), (7-39) is such that 
the number of cointegrating vectors is r (k —r unit roots), 
a’, 8, nonsingular, the GMM estimators, 


Ti T 
vec(@) =((S> z121 @ Uy") +( SY) tmz @EQ*));" 


t=1 t=T, +1 


Tı T 
vec(S) Sr 'Azra at S> Er Arg), 
t=1 t=T, +1 
(7-41) 
and 


vec(f') = (X t-12; @ & Ër 1a) 
t=1 
T bo Ty na 
+ ( F z1x 9&8) vec(_ @ Di Arz, (7-42) 
t=T, +1 t=1 


T 
+ 5 @ Ez Axx), 


t=T,+1 
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have a limiting behavior characterized by 
VTvec(& — a) +N (0, (w(cov(B’x),; ® a’ Erta) (7-43) 
+ (1 — w)(cov(6'x)2 9 a’Xz"a))~™), 
and 
T[vec(B: — 9) 
= (ELpL) Ba. @ 1) (Ar | WWA, @ a'Er'a)+ 
0 
w (7-44) 


+ AyW,(w))'dt @ a’ Dy1a))—vee[2 ( J dW,W!)M 
0 


+.04( f diWa(t)(A2Wi(t) + ArWs(w))'de)], 


Assuming that the value of r used for the estimators @ and 
8 equals the value of r of the DGP (7-38), substitution of 
these estimators in (7-40) leads to a limiting behavior of 


the resulting optimal value of the GMM objective function 
characterized by 


On, a J dW,W)K 
+ Ao( 7 dW, (t)(AoW,(t) + A1 Wi (w))'dt)] 


(Ar f WWI @ a, Erto) 
0 
1 


+ (fam) + A, W,(w)) 


(A.W, (t) + A W (w))/dt ® a’ Ezta) 
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veel, ( f dW,W,)M 
o (7-45) 
pA J dW, (t)(A2W,(t) + AW (w)}'dt)], 


where w = 4, W, and W3, are stochastically independent 
r, (k — r) dimensional Brownian motions with identity co- 
variance matrices, A, = (a’,Sy'a,)?, Ao = (œL Ezta), 
Qı = (a’D,a)?, Q = (a'a), 


= 1 Ti Tı z 
S = Tk 2 An — iÈ titia) tri 1)Azh, 
a 1 
2 T-T,-k 
fe T 
X Aetea (D> teata) ea) Ag, 
t=T,41 t=T, +1 


cov(6'z): = B' 2 C ECB, cov(b'£)a = P 2 C*DC7'6, 


and ( );’ are the first kr rows of ( )7!. 


PROOF OF THE THEOREM 


The asymptotic results for subsamples using the fraction 
w result from Perron [1989]. Using these asymptotics for 
subsamples, the other results follow straightforwardly from 
the proofs of Theorems 7.1-7.3. 
a 


The cointegrating vector estimator from Theorem 7.4 is a 
GMM-2SLS estimator as it is constructed in two sequential steps. 
In the first step, we estimate II in (7-3) using least squares and use 
its first r columns to construct @. Furthermore, we construct sy 
and S, as the sum of squared residuals of the two subsamples. In 
the second step then we construct the GMM estimator 8 (7-42). 


Theorem 7.4 shows the asymptotic normality of the GMM 
cointegrating vector estimator @. Jointly with Theorems 7.1-7.2, 
this shows the elegancy of the GMM as it leads to asymptotic 
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normality even in some models which violate the standard assump- 
tions. When we use a cointegrating vector estimator which neglects 
the heteroskedasticity of the disturbances, we cannot find accurate 
expressions of its covariance matrix so that it is hard to test hy- 
potheses on the cointegrating vector. The resulting estimator is not 
asymptotically normally distributed either. 


The limiting distribution of the optimal value of the GMM 
objective function depends on the relative change of the covariance 
matrix and w, the relative time period during which the variance 
differs. As it is not known what the true values of these parameters 
are, they are typically replaced by sample estimates. The resulting 
distribution is, in that case, no longer the true limiting distribution 
but only an approximation of it. It would be interesting to investi- 
gate whether nonparametric covariance estimators, like the White 
(White [1980]) or Newey—West (Newey and West [1987]), see also 
Chapter 3, can be used to overcome these difficulties. These covari- 
ance matrix estimators can directly be used in the GMM objective 
function but the expressions of the resulting limiting distributions 
are still unknown. 


7.4 Cointegration with Structural Breaks 


In ECCMs with structural breaks in the value of the multiplicator 
or the cointegrating vector, the maximum likelihood estimators no 
longer have a closed analytical form. The GMM, however, still 
gives analytical expressions of cointegrating vector estimators even 
in these kinds of models. To show this, we analyze the influence of 
a structural break in a, the value of the multiplicator, and 8 the 
cointegrating vector, at a Tı point. The model, therefore, reads 


Ax; = af’ x41 +E: t= 1, es Ln 
Ax, = 69/241 + & t=T, +1,..T, 
(7-46) 
where ¢€,, t = 1,...,T, are Gaussian white noise disturbances with 
covariance matrix &. The GMM objective function we use for this 
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model reads, 


Tı T 
Qr(a, B, Ys 0) = vec(S> 1041, a E4241)’ 
t=1 


t=T,+1 
A 1 1 1 
(È B11) @ 57!) 0 
T 
0 (CE tit) 87) 
t=T; +1 
Ty T 
vec($_ E£, So Epes 
t=1 t=T, +1 


(7-47) 
where vec(A, B) = (vec(A)’ vec(B)’)’. In Theorem 7.5, the GMM 
estimators of the cointegrating vector, multiplicator and their lim- 
iting distributions are presented jointly with the limiting distribu- 
tion of the GMM objective function. As the cointegrating vector 
estimators and multiplicators all have normal limiting distribution, 
standard x? tests can be performed to test for the equality of the 
parameters in each of the two periods. Theorem 7.5 also presents 
the estimators and their limiting distributions, which can be used 
when either the cointegrating vectors or multiplicators in each of 
the two periods are equal to each other. 


THEOREM 7.5 


When the DGP in (7—46) is such that the number of coin- 
tegrating vectors is r (k —r unit roots), a’, 3, nonsingular, 
the estimators, 


Tı Tı 
a= o> Az,(1 — a Ale 2410541) L221) 214-1) 


t=1 t=1 


s be (7-48) 
(5 £ı-1(1 — Aaa Lat-1L3-1) Ea-1)£1t-1) 
t=1 t=1 

ĝ = 
T T 

( ye Az, (1 — x54_1( 5 E412 9¢-1) Ta-1)Tit-1) 

t=T,4+1 t=T,+1 (7-49) 
T T 


( 5 £it-1(1 — £a—1( D Eot-1L2-1) Wor) Pig) 


t=T,4+1 t=T)+1 
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and 


Tı Tı 
B= (X tiaia) (S r ALE aaa), (7-50) 
t=1 t=1 


a = 

T T 

1 pa , -12 ðn- A- 7—51 
CE waa E manare O 
t=T,+1 t=7T,+1 
have a limiting behavior characterized by 
VT(G—a) > N(0, cov(8'z)! @ wd), (7-52) 

VT(6-0) > N(0, cov(7z)! @ (1—w)3), (7-53) 

and 
T8- 6) => 


ae w (7-54) 
(aesan d mW) wawpo ) ' 
T(Y2 — 72) > 
CARCAN VAENE O) 
(VBL Ba MW (w) + 72 (84.72) AWD] 
IAA (a BL) AW (w) + yiya (0y) AWE) 
dW,(t)'d}%, 
(7-55) 


The limiting behavior of the optimal value of the objective 
function can be characterized by, 


Qr(G, 8,7, 4) 


zee J W,aw?y’)'(( / WiW!)-! @ Ip_n)vec(( f W,dw)’) 
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1 
+ weet f 1A3%0' y (171) 72, Bx (a4, Bx) Wi (w) 


1 
+ W,(t)|dW, OVU AZ 8 ra 14) 29, ap) 
W,(w) + W (OAZ O yy 71) Y 61(0 BL) AW (w) 
1 
+ W(t) dt] @ In- )vec( fA O IDB) 


(7-56) 
When the model in (7—46) is such that the cointegrating 
vectors are equal in the two subsamples, @ = y, which can 
be tested using an asymptotic x? test, the GMM estimator 


for 8 reads (estimators for a and 0 result from (7—48) and 
(7-49), 


Ti 
vec(B’) = (X z:-17;-1 9 &@ET'8))+ 


t=1 


T 
(X tat OPE A U @ @ D7") 


t=T +1 


(vec($ ` Azzi) + (ik 8 6'D51) (vec( 5 Az;2'_;))] 


t=1 t=T,+1 


1 Tı 
=vec(( 7 ) y+ (o> p40, 1 Q @' Er â)) 
t=1 


P 
=M2 


(7-57) 


T Tı 
+( >> teati 80E O vec aE er) 
t=Tı +1 t=1 
T A, 
+vec( XO @'Xz*ec/_1)] 


t=T)4+1 
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and the limiting behavior of this estimator is characterized 
by 


Tvec(ĝz — Be) 
=> ((681)* 8 IVb) A 


(| WWA (B01) @ aE“) 
0 


+( J (061) M Wa (w) + (881) AW (6) 
(0B MW (w) + (8, B1) NeW, (¢))'dt] 
@ 0D)“ vee(M ( j dWWi)A! (La)! 


1 


+ Oa f[aWe(t)(Wi(w)A4 (8) aL) + WIA, 01). 
(7-58) 

When the model in (7-46) is such that the multiplicators of 

the cointegrating vectors are equal in the two susbsamples, 

a = 6, which can be tested using a x? test, the GMM 

estimator for a reads (estimators for @ and y result from 

(7-50) and ren 


a= (Anat, it 3 Axx, 19) x 


seas (7-59) 
(3 Sa- 124- B+? 5 Li-18419) i 
t=Tı +1 


and its limiting behavior is characterized by 

VT(@ — a) > N(0, wcov(8'x)ı + (1 — w)cov(y'z)2) (7—60) 
where w = a , W, and Wo, are stochastically independent 

r, (k-r) dimensional Brownian motions with identity co- 
variance matrices, A, = (a’,D~!a,)?, Az = (0E7101), 

N: = (a’Za)?, Q = (6'E8)?, 


1 
ag an 1-274 Sai: ie 4) ees, 


t=1 
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5, = T_T _Ę > Ax, 


t= efi 


(1— zi (> Ti-124 1) @e-1) Ari, 
t=7,4+1 
cov(B'x), = B 3 uCp, cov(y'r)2 = p’ 2 Cz X203p, 


and C(L), C(L ry are the Vector Moving ‘Average represen- 
tations of the first and second subsets. 


B 
PROOF OF THE THEOREM 
Again uses asymptotics for subsamples (see Perron [1989]) 
and results from proofs of Theorems 7.1-7.3. 
u 


Theorem 7.5 shows that the GMM estimators of the cointe- 
grating vector and multiplicator have normal limiting distributions 
in case of structural breaks in the cointegrating vector and/or 
multiplicator. Similar to the limiting distribution of the optimal 
value of the GMM objective function in case of heteroskedasticity, 
the limiting distribution of the optimal value of the GMM objective 
function again depends on model parameters and the relative size of 
the subsamples. An approximation of this limiting distribution can 
again be constructed using the estimated values of the parameters, 
a, B, 0, y and Tı. As this leads to a rather complicated testing 
procedure, it may be preferable to fix the number of cointegrating 
vectors a priori and just perform tests on the estimated cointegrated 
vectors and multiplicators, which are straightforward to construct. 
This also holds for the cointegration tests discussed in the previous 
section. 


7.5 Conclusion 


The GMM preserves many of its properties when used to construct 
estimators in cointegration models. Although cointegration models 
violate the standard assumptions used to derive the asymptotic 
normality of the GMM estimators, they still retain this property. 
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Tests on the parameters in cointegration models can therefore be 
conducted straightforwardly using asymptotic x? critical values. 
Tests on the cointegration rank can be performed using the GMM 
objective function as it has a limiting distribution which is identical 
to the limiting distribution of the Johansen trace statistic. Both 
these asymptotic properties and the easy analytical expressions 
of the GMM estimators show the attractiveness of the GMM in 
cointegration models. These features also hold for the standard 
cointegration models’ maximum likelihood estimators but these 
estimators loose a lot of their attractivity when more complex 
models are considered. We showed, however, that the GMM still 
gives tracktable analytical expressions for the estimators and allows 
for the derivation of the asymptotic properties even in more com- 
plicated cointegration models. In applied work, many cointegrated 
series, like most of the financial time series, violate the properties of 
the standard cointegation model and the development of estimation 
techniques able to deal with them is quite important. 


Appendix 


Proor oF THEOREM 7.1. 


In Johansen [1991], it is proved that the stochastic process 
x, from (7-1), can be represented by Ax, = C(L)D?&, 
where 8 = (I, —())', and & is a k—variate Gaussian 
white noise process with zero mean and identity covariance 
matrix. Consequently, 


t 
a, = Bi (a pL) AE? Se; + OLEG, 


j=l 
i I 1 
Bu = BLB ALED E + (FEN, 
j=l 


t 
1 —1 7 93 0 * 1 
Ta = (a Bi) 0E? > xs i G je (L)E? r, 
j=l g 


where 


C(L) = C(1) + (1 — L)C*(L), 
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C*(L) = } CyL". The least squares estimator of a, @, can 
i=0 


4 
also be expressed as 
T T 
a-—a sO w(i — ASD Loe—1% 41) *@ae—1) 2144-1) 
t=1 t=1 
T T 
(S trea (1 = 254 (> Poe) Woe 1) 241) 
t=1 t=1 
T ~ 
=O, Ut (1r—-1 — Bo%ot-1)’) 


olen 1 — Bo2-1)(Gie—-1 — Byt2e-1)')* 


where By = (Pla For—-10,_1) 7124-1241. Bo is a super- 
consistent estimator of 8 and can therefore be treated as 
equal to 6z in the derivation of the limiting distribution of 
@. Since 


1 2 a at 1 
T > (£it-1 — BaZat-1)(£i-1 — ba£t—1) > 
t= 


cov(8'z) = ay: CECB, 


i=0 
and, 


T 
TO Ut(Lie-1 — 6,22-1)’) => N(0, cov(8'z) 8 E), 


t=1 


the limiting distribution of @ becomes 
VT(@— a) > N(0, cov('xz)“! 9 X). 


With respect to the cointegrating vector, 


T T 
p= tta) O_ naaa Raa = 
t=1 t=1 
I, 
T 1 —1 = 1 1 1 -15/35-1211 
(E Te-121)2 (E t-1(2;-1pa +u) Eaa Ea) 


I, 
& 1-10,-4)3'(¥ e-a(@t_sBa! +) Eaa = 
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( a ) (2 ae 
Be (È tertile (D z1) DaDa) ) > 


where ( )z* indicates the last (k—r) rows of ()-! and â is a 
consistent estimator of a such that the difference between 
@ and a will only affects orders of convergence exceed- 


T 
ing T. Furthermore @ = (Y, Teata) (X L4-1A2}), 


where ();' indicates the first r rows of ()~!. To ana- 
lyze the limiting behavior of B, we have to determine 
the limiting expressions of both ($2; 2+-12/_,)z' and 
(LL riul) E talata). Starting with the latter ex- 
pression, its limiting behavior can be analyzed using the 
stochastic trend specification of £+—1. 


T 
DD 24-1U,) ala Da) = 


t=1 


a t-1 
(D7 Bx (a1, Ai) ta DHS" )gE-ba 
t=1 jal 
T 
+ CL) EEE a) Da). 


Since 52a, is orthogonal to E`ża, i.e., (Dia YE ża = 
a' aœ = 0, the Brownian motions appearing in the limiting 
expression are independent, 

1 T ryt — tyr 4 TAL 

+ ae! ‘3 ENED Ia > Ay J W, dWiA,, 

= j= 

since RL VDI &;) > AW, W, is a (k — r) dimen- 
sional Brownian motion with covariance matrix J,_, and 
Ay = (a La 1)3, W, is a r dimensional Brownian motion 
with covariance matrix I, and W, is stochastically indepen- 
dent of W,, Az = (a’D-1q)?. Also the limiting behavior of 
(Eli £121)! is determined by the stochastic trend 
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specification 


T 
(X teir) = 
t=1 


(8 BAIE BLY Ye 1&1) (8 BL É 
CENE 


So, the limiting behavior of 


(8 Bı y (X t121) (8 Bi) is, 


(P28 TB O data) T TBS 
t=1 
(ee ae 0 ) 
0 B Bi (a', BL) Aa (f WWA (œ bL) "BBL 


as 
T 
TY BP'O*(L)Et E182 C*(L)B = cov(6'z), 
t=1 
T A t—1 
T? bilab) Y (aE AE) 
t=1 j=1 
t—1 
(D2 EEan) bL) "8,6. > 
j=l 
BBL BLA f WWA 81)" Ba, 
T t-1 
T Y p'O (LE? Ea G D201) (aba) "BB 
t=1 j=l 


=> 0. 
Consequently, 


T 
((T-28 TB) ey Hit) (T728 T6] > 


a a 
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where 
© = (8,81) Laas" | WWA o pae 


and 
T 
(D> 24-12;_,)7' > O(T) Bcov(G'x)* 8’ + OT?) 
t=1 


MADMAN I LANE ANADA 


where O(T) indicates that the limiting behavior of this 
part is proportional to TÍ. The latter part governs the 


T 
limiting behavior of (X` £e-1%4_1)2', which can be char- 
t=1 


acterized by 


T 
TPO Tr-124_1)2 = 


t=1 


(BBL) Basar" f WWA A, BBB A 


1 
as 68, = ( : ‘ ) . So, the limiting expression for the coin- 


tegrating vector estimator becomes, 

z 0 

T(8— = , = z A= 1 
B-0=( pigan g WWS wawa) 
0 
“a an aE- ta D 5) 
and can be approximated by 
12 T 
T2 Do taa aal tuatua) Teita) 
t=1 t=1 : 


(= 0) 
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Proor oF THEOREM 7.2 


(Only the second part of Theorem 7.2 is proved here). 
When the DGP of x; reads, 
Az, = ar + a(B’ay-1 +p’) + E, 


where c = a,’ + ap’, it has the stochastic trend rep- 
resentation, see Johansen [1991], Az, = C(L)(c + ©2&), 
B=(1, 5)’, where & is a k—variate Gaussian white 
noise process with zero mean and identity covariance ma- 
trix. Consequently, 


t 
a, = B (œ B1) (ta X +E 6) 


j=1 


+ C*(1)ap' + C*(L)Dt&, 


t 
Tie = Bala BL) O (ta + 3 YS E) 


j=l 
(F) (Man! + Et), 
Lo = (œa bL) ta (tar + m2 DE) 
j=1 


+ (42) (C Dar CEt), 


where 
C(L) = C(1) + (1— L)C* (L), 
C*(L) = 5 CL, B'O a= L. 
The least squares a of a, @, can also be expressed 
as 
a@-a= 


Eme-(HVEC ICE 


t=1 


a N 
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Eee (MEH 


t=1 


(Pi pain 
Kje ED 
aÇ -EEEE 
Bo is a E TDN estimator of b2. Since 
Hien (8) Erpe GE) 


= cov(p'z — p’') = p 2 CIEC? P, 


i=0 


T?( Fulan- 17 a Ce yy) 
=> ‘NO, cov(S'r — u’) E), 
the limiting distribution of @ becomes 
VT(@— a) > N(0, cov(b'z — w)! O X). 


With respect to the cointegrating vector, 


and, 


EEE saree 


t= t=1 
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P-ENN EC) 
(eo i } c) a’ + ro’, +u 


ECT) Cr pegs) 


1 
(Gy 2 & a’ +w) Eala Ea) 
B T Z A 1 
N2 t-1 t-1 —1 
a ECTE 
( H ) > 1 1 poe 
I /x 
© ( a w)E alad Eta) 
t=1 
as a’, Eta = 0 since E! = PAP’, A =diag(X;) = 
TE Aee, PP’ = I, a’, PP’a = b,b = 0, Pa = b = 
i=1 i L L 
(b1..-b4)', aL ETa = bi Ab = yo, Abbi = 0 as b'i ibi = 0 
vi. (© ca r) te )z' indicates the last (k —r +1) 
Lt-1 Lr-1 


rows of (4 ( 1 1 


estimator of a such that the difference between @ and 
a will only affect orders of convergence exceeding T. To 
analyze the limiting behavior of 8, we have to determine 


1 
the limiting expressions of both ($2 (Ge ) ‘es ) ja 


1 
) )-1 and @ is a consistent 


1 1 
and (SŁ, a ”) ul) tala Eta) t. Starting with the 


latter expression, its limiting behavior can be analyzed 
using the stochastic trend specification. 


T 
(0 (71) alosa) 
=( j eom ERIO “i EDT? 


: 2 Q 
1 


> al Haw! + CEE) a OE 
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Since U?a, is orthogonal to 5~?a, ie., (D?a,)U-2a = 
a’,a = 0, the Brownian motions appearing in the limiting 
expression are independent, 


TAL 1 -1 a/ E 1 4 = isk 
(“pay ) (any tat, lta’ +I gE Fa) 


1 j= 
As J Z) JWA, 


and 


PAT sas 0 Voss T 
( 0 ; ry YO Bi (0,81) a" (tar 
t=1 


t-1 


1 tot W. TAT 
HEED Ha) > Ae | (%) awans, 


So, 
ft Irri 0 ) ' r 
E 0 T-3 A ( Ge i 
0 T3 t=1 
Wi 
ald Ea) > e YI T ) aw, 
L 


7 — 1 i — À + 
where 64 = 6i (B1B) B a laa) ( 5) , ÀX = 
0, Wi is a (k — r — 1) dimensional Brownian motion with 
covariance matrix Iķ-r-ı and 


he = (ALa) aL BF) (Ar(aiau)ta BY, 
AA AN ? 
w is a r dimensional Brownian motion with covariance 


matrix J, and W3 is stochastically independent of Wy, 
A = (a’D"1a)3, T(t) =t, (t) = 1, 0 < t < 1. Also the 


T / 
limiting behavior of (> Ge : ) ca : ) )~1 is determined 
t=1 
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by the stochastic trend specification: 
T: 1 
Lt-1 Lt-1 =i 
aeea, 
-(8 Bi (8 Bi 0 n> ti-i) (1)') 
0 0 1 0 0 1/ `&\ 1 1 


ea oe a 


—ł4 d ue EER 0 i 
Te pi ( *y) 0 ze- 
0 T a (> ( : ) 
0 0 T3 t=1 
1 pe * Tt =p= 0 
(F) ) (" 22 Bi ( ' : rs) 0 
: 0 0 T- 
cov(G'x — p’) + pH 0 yw 
> 0 a(S (2) (M) yas 0 
H 0 
Consequently, 

1 T-I, =j 0 $ T 
gs E 
eau 

0 0 [3 t=1 
, 1 TI, Spe 0 x 
o A aa E s 
5 0 0 T- 
cov(@'x — p’)7} 0 
—17/ Wi Wi f —14Ą-1 
0 ata (en ag bas 
—peov(S'x — w’) 0 


( —cov(b'x — p) p )) 
0 
1+ pcov(B'a — w) p 
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T 
The limiting behavior of (>> 2:_12/_,)7' now becomes 
t=1 


(AC) EC) rye 


T3 
* TIk-r-1 0 0 
0 T? 
si} W: Wi’ 24 
A3 0 1 11 11 zi As 0 
=(% i) qi ( 7 T a 


ra ya frALy 
where 63, = (8,81) pLa (aa) ( a . So, the 


limiting expression for the cointegrating vector estimator 
becomes, 


(ae. ph ) (28,85) 263 Dono 


0 T? H-H 


aa S] ("yop (F) 


>N (0, ada Q O2) 5 


where 


a- (3 DUTT e ay" 


Proor oF THEOREM 7.3 
(Only the first part is proved here, the other proofs are 


similar) The optimal value of the GMM objective function 
reads 


T 
Qr(&,B) = vec( et "de 1211) 1 8 X7!) 


TO 21) 
t=1 
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T T 
=vec(S (Aa, — a(a’x*a)ta’=* (5° Azizi) 


t=1 t=1 


T T T 
2 A 2 ity) O tama Sa) 
= = t=1 
T 
weet (Ax, — A(&@ DA) REY Ars, ) 


t=1 
oy Pest)" D 4124-1) 
t=1 t=1 


=vec((D7! — £714 (a’=- 1@)~ 1'E PO Ana DY 


T 
(X teir) 8 D)vec((Z7 — Eaa Eta) 


t=1 
T 
VENY Arzi). 
t=1 
T 
This functional consists of two parts, (( 2 48, 4) 89X) 
and vec((Ł -1-57 tR(& La) RE NE Az,2}_,)), each 
t= 
of which limiting behavior is analyzed Separately: Starting 


with the latter expression, 


A vec((E"? -yaan ayes" Nana, 1)) 
=p vee((@,(@', 581) ad Ax,x;_1)) 


t-1 
>= Zvec(as (al, Bau)~ Sa DES Eta (LaL) )) 
j=1 


sawn Saia Da) el J W, dW) A). 
While 


T? (X te-12;-1) 8 2) > 


(61 (0,81) BELa LAT” f WWA o, BL (LBL) 1B, 8 F). 
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So, 
Qr (a, B) 
sveo(Aa(f WidW4)'A4)'(Bx (0, B1)"* @ as (ol, 2a) 


(8. (8,81) BLL f WWA a Bs (81,81) E) 
(x(a 81) Ba (a, Ba) *)vee( Aa | WidwWiy Ai) 
Svec(A( | WidWi Y AAT” f WWA? @ (al, Zany) 
PIA, J Wi dW! YA,) 
= vee(( / W.dw'y')'(( J W,W?)-? @ In-r)vec(( f widw!)’) 


sir((f wawy f ww f wawi). 
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Chapter 8 


ESTIMATION OF LINEAR PANEL 


DATA MODELS USING GMM 


Seung C. Ahn and Peter Schmidt 


Use of panel data regression methods has become increasingly 
popular as the availability of longitudinal data sets has grown. 
Panel data contain repeated time series observations (T) for a large 
number (N) of cross sectional units (e.g., individuals, households, 
or firms). An important advantage of using such data is that they 
allow researchers to control for unobservable heterogeneity, that 
is, systematic differences across cross sectional units. Regressions 
using aggregated time series and pure cross section data are likely to 
be contaminated by these effects, and statistical inferences obtained 
by ignoring these effects could be seriously biased. When panel data 
are available, error components models can be used to control for 
these individual differences. Such a model typically assumes that 
the stochastic error term has two components: a time invariant 
individual effect which captures the unobservable individual het- 
erogeneity and the usual random noise term. Some explanatory 
variables (e.g., years of schooling in the earnings equation) are 
likely to be correlated with the individual effects (e.g., unobservable 
talent or IQ). A simple treatment to this problem is the within 
estimator which is equivalent to least squares after transformation 
of the data to deviations from means. 


Unfortunately, the within method has two serious defects. 
First, the within transformation of a model wipes out time invariant 
regressors as well as the individual effect, so that it is not possible 
to estimate the effects of time invariant regressors on the dependent 
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variable. Second, consistency of the within estimator requires that 
all the regressors in a given model be strictly exogenous with respect 
to the random noise. The within estimator could be inconsistent 
for models in which regressors are only weakly exogenous, such as 
dynamic models including lagged dependent variables as regressors. 
In response to these problems, a number of studies have developed 
alternative GMM estimation methods. 


In this chapter we provide a systematic account of GMM es- 
timation of linear panel data models. Several different types of 
models are considered, including the linear regression model with 
strictly or weakly exogenous regressors, the simultaneous regression 
model, and a dynamic linear model containing a lagged dependent 
variable as a regressor. In each case, different assumptions about 
the exogeneity of the explanatory variables generate different sets 
of moment conditions that can be used in estimation. This chapter 
lists the relevant sets of moment conditions and gives some results 
on simple ways in which they can be imposed. In particular, 
attention is paid to the question of under what circumstances the 
efficient GMM estimator takes the form of an instrumental variables 
estimator.! 


8.1 Preliminaries 


In this section we introduce the general model of interest, some 
basic assumptions, and some notation. Given linear moment con- 
ditions, we consider the efficient GMM estimator and other related 
instrumental variables estimators. We also examine general condi- 
tions under which the efficient GMM estimator is of an instrumental 
variables form. 


1 Here and throughout this chapter, “efficient” means “asymptotically 


efficient”. 
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8.1.1 General Model 


The models we examine in this chapter are of the following common 
form: 

Yit = Lup + Uit; Uit = Qi + Eit- (8-1) 
Here i = 1,..., N indexes the cross sectional unit (individual) and 
t = 1,..., T indexes time. The dependent variable is y;, while xj is 
a 1 x p vector of explanatory variables. The p x 1 parameter vector 
6 is unknown. The composite error u; contains a time invariant 
individual effect a; and random noise ¢;,. We assume that E(a;) = 
0 and E(e) = 0 for any i and t; thus, E(u) = 0. The time 
dimension T is held fixed, so that usual asymptotics apply as N gets 
large. We also assume that the data {(yi,...,yir,2i,---, Ler)’ | 
t= 1,...,N} are independently and identically distributed (i.i.d.) 
over different i and have finite population moments up to fourth 
order. Under this assumption, any sample moment (up to fourth 
order) of the data converges to its counterpart population moment 
in probability, e.g., plimyoN ON, tye = ELY). 


Some matrix notation is useful throughout this chapter. For 
any single variable c;, or row vector diz, we denote c; = (ci,..., cir)’ 
and Dit = (dj,,...,di~)’. Accordingly, y; and X; denote the data 
matrices of T rows. In addition, for any T x 1 vector c; or T x k 
matrix D;, we denote c = (cj,...,cy)’ and D = (Dj,..., Dy)’. 
Accordingly, y and X denote the data matrices of NT rows. With 
this notation, we can rewrite equation (8-1) for individual i as 


Yi = Xib + ui; u; = er Q Qi + £i, (8-2) 
where er is a T x 1 vector of ones, and all NT observations as 
y=Xß+u. 

We treat the individual effects a; as random, so that we can 
define © = cov(u;) = E(u;u;). A standard and popular assumption 
about X is the so-called random effects structure 

L=cerep+o7lr, (8-3) 


which arises if the £; are i.i.d. over time and independent of a&i. 
This covariance structure often takes an important role in GMM 
estimation as we discuss below. 


There are two well-known special cases of the model (8-2); the 
traditional random effects and fixed effects models. Both of these 
models assume that the regressors X; are strictly exogenous with 
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respect to the random noise €; (i.e., E(xi,¢i,) = 0 for any t and s). 
The random effects model (Balestra and Nerlove [1966]) treats the 
individual effects as random unobservables which are uncorrelated 
with all of the regressors. Under this assumption, the parameter 
b can be consistently and efficiently estimated by generalized least 
squares (GLS): Berg = (X’Q-!X)-1X'N-1y, where Q = Iy Q F. 


In contrast, when we treat the a; as nuisance parameters, the 
model (8-2) reduces to the traditional fixed effects model. A simple 
treatment of the fixed effects model is to remove the effects by 
the (within) transformation of the model (8-2) to deviations from 
individual means: 


Qryi = Q7XiB + Qrui = QrXiB + Qréi, (8-4) 
where Qr = Ir — Pr, Pr =T~‘ere’p , and the last equality results 
from Qrer = 0. Least squares on (8-4) yields the familiar within 
estimator: By = (X'’QyX)'X'Qvy, where Qy = In 8 Qr. 


Although the fixed effects model views the effects a; as nuisance 
parameters rather than random variables, the fixed effects treat- 
ment (within estimation) is not inconsistent with the random ef- 
fects assumption. Mundlak [1978] considers an alternative random 
effects model in which the effects a; are allowed to be correlated 
with all of the regressors 2j1,...,2;7. For this model, Mundlak 
shows that the within estimator is an efficient GLS estimator. This 
finding implies that the core difference between the random and 
fixed effects models is not whether the effects are literally random 
or nuisance parameters, but whether the effects are correlated or 
uncorrelated with the regressors. 


8.1.2 GMM and Instrumental Variables 


In this subsection we examine GMM and other related instrumental 
variables estimators for the model (8-2). Our main focus is a 
general treatment of given moment conditions, so we do not make 
any specific exogeneity assumption regarding the regressors X;. We 
simply begin by assuming that there exists a set of T xk instruments 
Z; which satisfies the moment condition 


E(Zjui) = 0 (8-5) 
and the usual identification condition, rank[E(Z/X;)| = p. 
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Under (8-5) and other usual regularity conditions, a consistent 
and efficient estimate of 8 can be obtained by minimizing the GMM 
criterion function N(y— XBY Z(Vn) Z' (y — XB), where Vy is any 
consistent estimate of V = E(Zju;uiZ;). A simple choice of Vy is 


N 
-1 InN N 
N N Zitt, Zi, 
i=1 


where &; = y; — X:ĝ and B is an initial consistent estimator such as 
two stage least squares (2SLS). The solution to the minimization 
leads to the GMM estimator: 


Boum = [X'Z(Vy)1Z'X}-2X'Z(Vy) 1 Z'y. 


An instrumental variables estimator, which is closely related 

with this GMM estimator, is three stages least squares (3SLS): 
Pssis = [X'Z(Z'QZ)Z' X] X'Z(Z'QZ) Z'y, 

where Q = Iyn&È. For notational convenience, we assume that X is 
known, although, in practice, it should be replaced by a consistent 
estimate such as © = N-!~™, @;@.. In order to understand the 
relationship between the GMM and 3SLS estimators, consider the 
following condition 


Under this condition, the 3SLS estimator is asymptotically identical 
to the GMM estimator, because 


N 
plim NZ'QZ = plim N Y` Za Z; = E(Zjuu,Z;) =V. 
N—-oo N- oo j=l 
We will refer to (8-6) as the condition of no conditional het- 
eroskedasticity (NCH). This is a slight misuse of terminology, since 
(8-6) is weaker than the condition that E(u;u; | Z;) = 2. However, 
(8-6) is what is necessary for the 3SLS estimator to coincide with 
the GMM estimator. When (8-6) is violated, Boum is strictly more 


efficient than bzs LS: 

An alternative to the 3SLS estimator, which is popularly used 
in the panel data literature, is the 2SLS estimator obtained by 
premultiplying (8-2) by =~? to filter u;, and then applying the 
instruments Z;: 


Brry = [XQ Z(Z' Z) Z'O XJ IXIA (ZZ) Z'N y. 
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We refer to this estimator as the filtered instrumental variables 
(FIV) estimator. This estimator is slightly different from the gen- 
eralized instrumental variables (GIV) estimator, which is originally 
proposed by White [1984]. The GIV estimator is also 2SLS applied 
to the filtered model D~2y,; = D-2 X;6 + U-2u;, but it uses the 
filtered instruments =~? Z;. Thus 


Bary = [X'N71Z(Z'N4Z)*Z'NAX] XA Z 
(ZF Z) AZ ty, 
Despite this difference, the FIV and GIV estimators are often equiv- 
alent in the context of panel data models, especially when © is of 
the random effects structure (8-3). We may also note that the FIV 


and GIV estimators would be of little interest without the NCH 
assumption. 


A motivation for the FIV (or GIV) estimator is that filtering 
the error u; may improve the asymptotic efficiency of instrumental 
variables, as GLS improves upon ordinary least squares (OLS).? 
However, neither of the 3SLS nor FIV estimators can be shown 
to generally dominate the other. This is so because the FIV 
estimator is a GMM estimator based on the different moment 
condition E(Z/X~2u;) = 0 and the different NCH assumption 
E(Z!d-2uulD- 2 Zi) = E(Z!Z;). 

We now turn to conditions under which the 3SLS and FIV 
estimators are numerically equivalent, whenever the same estimate 
of È is used. 


THEOREM 8.1 


Suppose that there exists a nonsingular, nonstochastic 
matrix B such that 5-2Z, = Z,B for all i (that is, 
0-3Z = ZB). Then, Brry = Bsszs- 


The proof is omitted because it is straightforward. We note that 
the numerical equivalence result of theorem 8.1 holds only if the 
same estimate of X is used for both @rry and 63575. However, 
when two different but consistent estimates of © are used, the two 
estimators remain asymptotically identical. 


2 White [1984] offers some strong conditions under which the GIV estimator 


dominates the 3SLS estimator in terms of asymptotic efficiency. 
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The main point of Theorem 8.1 is that under certain assump- 
tion, filtering does not change the efficiency of instrumental vari- 
ables or GMM. When Theorem 8.1 holds but the instruments Z; 
violate the NCH condition (8-6), both the FIV and 3SLS estimators 
are strictly dominated by the GMM estimator applied without 
filtering. Clearly, Theorem 8.1 imposes strong restrictions on the 
instruments Z; and the covariance matrix X, which do not generally 
hold. Nonetheless, in the context of some panel data models, the 
theorem can be used to show that filtering is irrelevant for GMM 
or 3SLS exploiting all of the moment conditions. We consider a few 
examples below. 


8.1.2.1 Strictly Exogenous Instruments and Random 
effects 


Consider a model in which there exists a 1 x k vector of instruments 

hit which are strictly exogenous with respect to the £; and uncor- 

related with a;. For this model, we have the moment conditions 
E(h,,uiz) =0, s, t=1,...,T. (8-7) 


This is a set of kT? moment conditions. Denote h?, = (his,..., Rit) 
for any t = 1,...,T, and set Zsg; = Ir @ h%, so that all 
of the moment conditions (8-7) can be expressed compactly as 
E(Zon iti) =0. 


We now show that filtering the error u; does not matter in 
GMM. Observe that for any T x T nonsingular matrix A, 


AZsri = A(In Q h?r = A &Q h?r = (In ® h?r (A &Q Tyr) = ZsgeiB, 
where B = A Q Ixr. If we replace A by £72, Theorem 8.1 holds. 


8.1.2.2 Strictly Exogenous Instruments and Fixed Effects 


We now allow the instruments hu to be correlated with a; (fixed 
effects), while they are still assumed to be strictly exogenous with 
respect to the €x. For this case, we may first-difference the model 
(8-2) to remove a;: 


Lryi = LrX:ß + Lou; = Lr Xıb + Lei, (8-8) 
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where Lr is the T x (T — 1) differencing matrix 


1 00.. 0O 0 
-1 10... 0 0 
brs jz Zi Hy N 
000... -1 1 
000... 0 -1 


We note that Lr has the same column space as the deviations from 
means matrix Qr (in fact, Qr = Lr(L Lr) Lh). This reflects the 
fact that first differences and deviations from means preserve the 
same information in the data. 


Note that strict exogeneity of the instruments h;, with respect 
to the ex implies E(Zgpr_,lpui) = E(ZspreiLrEi) = 0, where 
ZserFEi = Ir-ı Q hèy. Thus, the model (8-8) can be estimated 
by GMM using the instruments Zszrz,;. Once again, given the 
NCH condition, filtering does not matter for GMM. To see why, 
define M = [cov(Li,u;)|"? = (L ULr)~2. Then, by essentially 
the same algebra as in the previous section, we can show the FIV 
estimator applying the instruments Zszrz,; and the filter M to 
(8-8) is asymptotically equivalent to the 3SLS estimator applying 
the same instruments to (8-8). 


8.2 Models with Weakly Exogenous 
Instruments 


In this section we consider the GMM estimation of the model (8-2) 
with weakly exogenous instruments. Suppose that there exists a 
vector of 1 x k instruments hy, such that 


E(h,is) =0, t=1,...,T, t<s. (8-9) 


There are T(T + 1)k/2 such moment conditions. These arise if 
the instruments h; are weakly exogenous with respect to the Ex 
and uncorrelated with the effect a;. If the instruments are weakly 
exogenous with respect to the £; but are correlated with the effects, 
we have a smaller number of moment conditions: 


E(h,Auis) = E(h, Aes) =0, t=1,...,T-1, t<s, (8-10) 


where Auj, = Uis — Ui,s-1, and similarly for Ae,;,. In this section, 
our discussion will be focused on GMM based on (8-9) only. We do 
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this because essentially the same procedures can be used for GMM 
based on (8-10). The only difference between (8-9) and (8-10) 
lies in whether GMM applies to the original model (8-2) or the 
differenced model (8-8). 


8.2.1 The Forward Filter Estimator 


For the model with weakly exogenous instruments, Keane and Run- 
kle [1992] propose a forward filter (FF) estimator. To be specific, 
define a T x T upper-triangular matrix F = [F;;](Fi; = 0 for 
i > j) that satisfies FEF” = Ir, so that cov(Fu;) = Ir. With this 
notation, Keane and Runkle propose a FIV—type estimator, which 
applies the instruments H; = (hi,,,,., Rip} after filtering the error 
u; by F 


Brr = [X F” H(H'H)"'H'F*X)]"1X'F" H (HH) 'H'F'y, 
where F* = Iy ® F. 


The motivation of this FF estimator is that the FIV estimator 
using the instruments H; and the usual filter =~ is inconsistent 
unless the instruments H; are strictly exogenous with respect to the 
€; In contrast, the forward-filtering transformation F preserves 
the weak exogeneity of the instruments h;. Hayashi and Sims 
[1983] provide some efficiency results for this forward filtering in the 
context of time series data. However, in the context of panel data 
(with large N and fixed T), the F'F estimator does not necessarily 
dominate (in terms of efficiency) the GMM or 3SLS estimators 
using the same instruments H;. 


One technical point is worth noting for the FF estimator. 
Forward filtering requires that the serial correlations in the error 
u; do not depend on the values of current and lagged values of 
the instruments h;,. (See Hayashi and Sims [1983, pp. 788-789.]) 
This is a slightly weakened version of the usual condition of no 
conditional heteroskedasticity; it is weakened because conditioning 
is only on current and lagged values of the instruments. If this 
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condition does not hold, in general, 


N N 
plim N+ X` Hi Fuu, F" H; # plim N X` H; FEF'H; 
N-oo i=l N—-o0o i=1 


N 
= plim N7! H! H; 
TRN E Pe 
and the rationale for forward filtering is lost. Sufficient conditions 
under which the autocorrelations in u; do not depend on the history 


of hy are given by Wooldridge [1996, p. 401]. 


8.2.2 Irrelevance of Forward Filtering 


The FF and 3SLS estimators using the instruments H; are ineffi- 
cient in that they fail to fully exploit all of the moment conditions 
implied by (8-9). As Schmidt, Ahn and Wyhowski [1992] suggest, 
a more efficient estimator can be obtained by GMM using the 
T x T(T + 1)k/2 instruments 


Zw ni = diag(hiy,-.., hip), 


where, as before, h9, = (hi,..., hi). When these instruments are 
used in GMM, filtering u; by F becomes irrelevant. This result can 
be seen using Theorem 8.1. Schmidt, Ahn and Wyhowski show that 
there exists a T(T+1)/2xT(T+1)/2 nonsingular, upper-triangular 
matrix E such that FZwe; = Zwe,(L@I,) (Equation (10), p. 11). 
Thus, the FF and 3SLS estimators using the instruments Zwz,: 
are numerically identical (or asymptotically identical if different 
estimates of X are used); filtering does not matter. Of course, both 
of these estimators are dominated by the GMM estimator using the 
same instruments but an unrestricted weighting matrix, unless the 
instruments satisfy the NCH condition (8-6). 


This irrelevance result does not mean that filtering is mean- 
ingless, even practically. In some cases, the (full) GMM estimator 
utilizing all of the instruments Zwg„ may be practically infeasible. 
For example, in a model with 10 weakly exogenous instruments 
and 10 time periods, the total number of instruments is 550. This 
can cause computational problems for GMM, especially when the 
cross section dimension is small. Furthermore, GMM with a very 
large number of moment conditions may have poor finite sample 
properties. For example, see Tauchen [1986], Altongi and Segal 
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[1996] and Andersen and Sgrensen [1996] for a discussion of the 
finite sample bias of GMM in very overidentified problems. As 
a general statement, when the number of moment conditions is 
much larger than the number of parameters, the GMM estimates 
tend to have a substantial finite sample bias, which is usually 
attributed to correlation between the sample moments and the 
sample weighting matrix. Furthermore, in these circumstances the 
asymptotic standard errors tend to understate the true sampling 
variability of the estimates, so that the precision of the estimates is 
overstated. The trade-off between increased informational content 
(lower variance) and larger bias as more moment conditions are 
used could be resolved in terms of mean square error, but this does 
not yield operational advice to the practitioner. In the present 
context, Ziliak [1997] reports that the FF estimator dominates the 
GMM estimator in terms of both bias and mean square error, so 
that the FF estimator may be recommended, if the NCH condition 
holds. If the NCH condition does not hold, it is not clear which 
estimator to recommend. Clearly further research is needed to find 
modified versions of GMM, or reduced sets of moment conditions, 
that lead to estimators with reasonable finite sample properties. 
Some basic principles seem obvious; for example, instruments that 
are distant in time from the errors with which they are asserted 
to be uncorrelated are likely to not be very useful or important. 
However, the link between these basic principles and constructing 
reasonable operational rules for choosing moment conditions still 
needs to be made. 


8.2.3 Semiparametric Efficiency Bound 


In GMM, imposing more moment conditions never decreases as- 
ymptotic efficiency. An interesting question is whether there is 
an efficiency bound which GMM estimators cannot improve upon. 
Once this bound is identified, we may be able to construct an 
efficient GMM estimator whose asymptotic covariance matrix co- 
incides with the bound. Chamberlain [1992] considers the semi- 
parametric bound for the model in which the assumption (8-9) is 
replaced by the stronger assumption 


E luis | ho, = (his... hi] =0, for t=1,...,T. (8-11) 


222 Estimation of Linear Panel Data Models Using GMM 


For this case, let gx: be a 1x k vector of instruments which are some 
(polynomial) functions of h?,. Define Gi; = diag(gr,i1, - - - , gkir); SO 
that under (8-11), E(Gj’,;u;) = 0. Under some suitable regularity 
conditions, Chamberlain shows that the semi—parametric efficiency 
bound for GMM based on (8-11) is 7 1 where 


Bo = lim E(X;G} o) [E (Gruiu Gi a) EB (GRX:). 


This bound naturally suggests the GMM estimator based on the 
moment condition E(G},;u;) = 0 with large k. However, when 
k grows with N without any restriction, the usual asymptotic 
GMM inferences obtained by treating k as fixed would be mislead- 
ing. In response to this problem, Hahn [1997] rigorously examines 
conditions under which the usual GMM inferences are valid for 
large k and the GMM estimator based on the moment condition 
E(G{,u:) = 0 is efficient. Under the assumption that 
kt 


Jm Wy” 0, (8-12) 


Hahn establishes the asymptotic efficiency of the GMM estimator. 
A similar result is obtained by Koenker and Machado [1996] for a 
linear model with heteroskedasticity of general form. They show 
that the usual GMM inferences, treating the number of moment 
conditions as fixed, are asymptotically valid if the number of mo- 
ment conditions grows more slowly than N3. These results do not 
directly indicate how to choose the number of moment conditions 
to use, for a given finite value of N, but they do provide grounds for 
suspicion about the desirability of GMM using numbers of moment 
conditions that are very large. 


8.3 Models with Strictly Exogenous 
Regressors 


This section considers efficient GMM estimation of linear panel data 
models with strictly exogenous regressors. The model of interest in 
this section is the standard panel data regression model 

yi = RE+ (er 9 wi)y +u; = Xib +u; u; = (er Qai) +e (8-13) 


where R; = [ri,,...,7ir]' is a T xk matrix of time varying regressors, 
er Ow; = [w),..., wi] isa T xg matrix of time invariant regressors, 
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and € and y are k x 1 and g x 1 vectors of unknown parameters, 
respectively. We assume that the regressors r;, and w; are strictly 
exogenous to the €; that is, 


E(d; Q £;i) = 0, (8-14) 
where d; = (ri,...,Tir, wi). We also assume the random effects 
covariance structure © = o2ere, + o?Ir as given in equation 


(8-3). For notational convenience, we treat o2 and o? as known. 
See Hausman and Taylor [1981] for consistent estimation of these 
variances. 


Efficient estimation of the model (8-13) depends crucially on 
the assumptions about correlations between the regressors d; and 
the effects a;. When the regressors are uncorrelated with a;, the 
traditional GLS estimator is consistent and efficient. If the re- 
gressors are suspected to be correlated with the effect, the within 
estimator can be used. However, a serious drawback of the within 
estimator is that it cannot identify y because the within transfor- 
mation wipes out the time invariant regressors w; as well as the 
individual effects a;. 


In response to this problem, Hausman and Taylor [1981] con- 
sidered the alternative assumption that some but possibly not all 
of the explanatory variables are uncorrelated with the effect a,. 
This offers a middle ground between the traditional random effects 
and fixed effects approaches. Extending their study, Amemiya 
and MaCurdy [1986] and Breusch, Mizon and Schmidt [1989] con- 
sidered stronger assumptions and derived alternative instrumental 
variables estimators that are more efficient than the Hausman and 
Taylor estimator. A systematic treatment of these estimators can 
be found in Mátyás and Sevestre [1995, Ch.6]. In what follows, we 
will study these estimators and the conditions under which they 
are efficient GMM estimators. 


8.3.1 The Hausman and Taylor, Amemiya 
and MaCurdy and Breusch, Mizon and 
Schmidt Estimators 


Following Hausman and Taylor, we decompose r; and w; into Tri = 
(rit, Toit) and w; = (wii, wai), where riz and rog are 1 x ky and 


224 Estimation of Linear Panel Data Models Using GMM 


1 x ke, respectively, and w; and wz; are 1 x gı and 1 x go. With 
this notation, define: 


SHT, = (Fri, Wii); SAM = (rit, eee TUT, wii); SBMS, 7 (Sam, i Pai), 
(8-15) 
where F = (fran — Toi,---,72i,7-1 — Foi). Hausman and Taylor, 


Amemiya and MaCurdy and Breusch, Mizon and Schmidt impose 
the following assumptions, respectively, on the model (8-13): 


E(s'gr 2i) = 0; E(s'a miai) =0; E(s'gms.%) =0. (8-16) 


These assumptions are sequentially stronger. The Hausman and 
Taylor Assumption E(s‘,7,0;) = 0 is weaker than the Amemiya 
and MaCurdy assumption E(s‘,,,,01) = 0, since it only requires 
the individual means of rış to be uncorrelated with the effect, 
rather than requiring r1,, to be uncorrelated with a; for each t. 
(However, as Amemiya and MaCurdy argue, it is hard to think 
of cases in which each of the variables rı is correlated with a; 
while their individual means are not.) Imposing the Amemiya and 
MaCurdy assumption instead of the Hausman and Taylor assump- 
tion in GMM would generally lead to a more efficient estimator. 
The Breusch, Mizon and Schmidt assumption E(s‘py5,0i) = 0 is 
based on the stationarity condition, 


E(r5,0:) isthe same for t=1,...,T. (8-17) 


This means that, even though the unobserved effect a; is allowed to 
be correlated with roz, the covariance does not change with time. 
Cornwell and Rupert [1988] provide some evidence for the empirical 
legitimacy of the Breusch, Mizon and Schmidt assumption. They 
also report that GMM imposing the Breusch, Mizon and Schmidt 
assumption rather than the Hausman and Taylor or Amemiya and 
MaCurdy assumptions would result in significant efficiency gains. 


Hausman and Taylor, Amemiya and MaCurdy and Breusch, 
Mizon and Schmidt consider FIV estimation of the model (8-13) 
under the random effects assumption (8-3). The instruments used 
by Hausman and Taylor, Amemiya and MaCurdy and Breusch, 
Mizon and Schmidt are of the common form 


Zai = (QrRi, er 8 si), (8-18) 


where the form of s; varies across authors; that is, $; = suri, SAM,i 
or SBMS,i- 
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Consistency of the FIV estimator using the instruments Z4, 
requires E(Z!, =~? ui) = 0. This condition can be easily justified 
under (8-3), (8-14) and (8-16). Without loss of generality, we set 
o? = 1. Then, it can be easily shown that D-! = 0?Pp + Qr and 
=>? = Pr + Qr, where 6? = o?2/(o2 + To?). With this result, 
strict exogeneity of the regressors r; and w; (8-14) implies 


E(RiQr=-?u;) = E[RiQr(0Pr + Qr)ui] = E(RiQrei) = 0. 
In addition, both (8-14) and (8-16) imply 
El(er 8 s: £7 ?u:] = EO (er Q s;)'(er Q a;)| = TO x E(sia;) = 0. 


Several properties of the FIV estimator are worth noting. First, 
the usual GMM identification condition requires that the number 
of columns in Z4, should not be smaller than the number of pa- 
rameters in 8 = (&’,7’)’. This condition is satisfied if the number 
of variables in s; is not less than the number of time invariant 
regressors (e.g., kı > go for the Hausman and Taylor case). Second, 
the FIV estimator is an intermediate case between the traditional 
GLS and within estimators. It can be shown that the FIV estimator 
of € equals the within estimator if the model is exactly identified 
(e.g., ki = go for Hausman and Taylor), while it is strictly more 
efficient if the model is overidentified (e.g., kı > g2 for Hausman 
and Taylor). The FIV estimator of 8 is equivalent to the GLS 
estimator if k2 = g2 = 0 (that is, no regressor is correlated with 
a). For more details, see Hausman and Taylor. 


Finally, the FIV estimator is numerically equivalent to the 
3SLS estimator applying the instruments Z4, (if the same estimate 
of © is used); thus, filtering does not matter. To see this, observe 
that 

E`? Za = (0Pr + Qr)(QrRi, er 8 s:) 
= (QrRi, ber ® si) = (QrRi, er Q s;) diag (Ia), 912) 
= Za, diag (I), 1)), 
where Ia) and I) are conformable identity matrices. Since the 
matrix diag (Ja), @J(2)) is nonsingular, Theorem 8.1 applies: Brry = 
Bssrs- This result also implies that the FIV estimator is equally 
efficient as the GMM estimator using the instruments Z,,,, if the 


instruments satisfy the NCH condition (8-6). When this NCH 
condition is violated, the GMM estimator using the instruments 
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Za, and an unrestricted weighting matrix is strictly more efficient 
than the FIV estimator. 


8.3.2 Efficient GMM Estimation 


We now consider alternative GMM estimators which are poten- 
tially more efficient than the Hausman and Taylor, Amemiya and 
MaCurdy or Breusch, Mizon and Schmidt estimators. To be- 
gin with, observe that strict exogeneity of the regressors r; and 
w; (8-14) implies many more moment conditions than the Haus- 
man and Taylor, Amemiya and MaCurdy or Breusch, Mizon and 
Schmidt estimators utilize. The strict exogeneity condition (8-14) 
implies 
E{(Lr ® d;)'us] = E(Lpu; Q d;) = E[L7(era; + Ei) ® d;)| 
= E( Le; & d;) = 0, 


where Lr 8 d; is T x {(T —1)(kT +g)}. Based on this observation, 
Arellano and Bover [1995] (and Ahn and Schmidt [1995]) propose 
the GMM estimator using the instruments 


Zg; = (Lr 8 d; er 8 si), (8-19) 


which include (T — 1)(Tk +g) — k more instruments than Z,4;. The 
covariance matrix need not be restricted. Clearly, the instru- 
ments Zp; subsume Z4, = (Qr Ri, er8s:) which are essentially the 
Hausman and Taylor, Amemiya and MaCurdy or Breusch, Mizon 
and Schmidt instruments. Thus, the GMM estimator utilizing all 
of the instruments Zp, cannot be less efficient than the GMM 
estimator using the smaller set of instruments Z,4;. In terms of 
achieving asymptotic efficiency, there is no reason to prefer to use 
the fewer instruments Zp. 


However, using all of the instruments Z4, may not be practi- 
cally feasible, even when T is only moderately large. For example, 
consider the case in which k = g = 5 and T = 10. For this 
case, the number of the instruments in Zg,, exceed the number 
of moment conditions in Z4,; by 490(= 495 — 5). For such cases, 
the GMM estimator using the Hausman and Taylor, Amemiya and 
MaCurdy or Breusch, Mizon and Schmidt instruments would be of 
more practical use. 
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In addition, the GMM (or FIV) estimator using the instruments 
Za, can be shown to be asymptotically as efficient as the GMM 
estimator using all of the instruments 7, ;, under specific assump- 
tions that are consistent with the motivation for the Hausman and 
Taylor, Amemiya and MaCurdy or Breusch, Mizon and Schmidt 
estimators. Arellano and Bover [1995] provide the foundation for 
this result. 


THEOREM 8.2 


Suppose that © has the random effect structure (8-3). 
Then, the 3SLS estimator using the instruments Zg, is nu- 
merically identical to the 3SLS estimator using the smaller 
set of instruments Z, ;, if the same estimate of X is used.? 

a 


Although Arellano and Bover [1995] provide a detailed proof of the 
theorem, we provide a shorter alternative proof. In what follows, we 
use the usual projection notation: For any matrix B of full column 
rank, we define the projection matrix P(B) = B(B’B)"1B’ . The 
following lemma is useful for the proof of Theorem 8.2. 


Lemma 8.1 


Let L, = In @ Lr and D= ((Ir-1 @d,)’, EE (Ir @dy)’]’. 
Define V = Iy Q er and W = (w},..., wy)’, so that X = 
(R,VW). Then, P(L,D)X = P(QyR)X. 


PROOF OF THE LEMMA 
Since 
Qv = P(L,) = L (L, LAL, 
QvR= L [(L LA LLR]. 


In addition, since D spans all of the columns in 
(LLLI R, L,D must span 


Qv R = L,|[(LLL.) L,R]. 


3 Even if different estimates of © are used, these 3SLS estimators are still 
asymptotically identical. 
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Finally, since Qy L, = L, and Qy V = 0, 
P(L,D)X = P(L,D)QvX = (P(L.D)QvR, 0) 
= (Qv R,0) = P(QvR)X. 


We are now ready to prove Theorem 8.2. 


PROOF OF THE THEOREM 


Note that Z4 = [Qv R, VS] and Zg = [L.D,VS], where 
S = (s},...,sy)’. Since L, and Qy are in the same 
space and orthogonal to both V and Py = P(V), we 
have P(Z4) = P(QvR)+ P(VS) and P(Zg) = P(L.D) + 
P(VS). Using these results and the fact that Q7? = 
0Py + Qv, we can also show that the 3SLS estimators 
using the instruments 74; and Zp,,, respectively, equal 


Ba = [X'{OP(QyR) + P(VS)}X]'X'{OP(QvR) + P(VS)}y; 

Bp = [X'{P(L.D) + P(VS)}X]"!X'{OP(L,.D) + P(VS)}y. 
However, Lemma 8.1 implies {9P(L,D) + P(VS)}X = 
{0P(Qv R) + P(VS)}X. Thus, ĝe = = Ba. 


Theorem 8.2 effectively offers conditions under which the 3SLS (or 
FIV) estimator using the instruments Z,4; is an efficient GMM 
estimator. Under the random effects structure (8-3), the 3SLS es- 
timator using all of the instruments Z4, equals the 3SLS estimator 
using the full set of instruments Zg. Thus if, in addition to (8-3), 
the instruments Zg, satisfy the NCH assumption (8-6), the 3SLS 
(or FIV) estimator using the instruments Z4, should be asymptot- 
ically equivalent to the efficient GMM estimator exploiting all of 
the moment conditions F(Z, ,u;) = 0. Note that both assumptions 
(8-3) and (8-6) are crucial for this efficiency result. If one of these 
assumptions is violated, the GMM estimator exploiting all of the 
instruments Zg; is strictly more efficient than the GMM estimator 
using the instruments Z4,- 
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8.3.3 GMM with Unrestricted £ 


Im, Ahn, Schmidt and Wooldridge [1996] examine efficient GMM 
estimation for the case in which the instruments Zp, satisfy the 
NCH condition (8-6), but © is unrestricted. For this case, Im, Ahn, 
Schmidt and Wooldridge consider the 3SLS estimator using the 
instruments D-'Z4; = 5~'(QrRi, er ® si), which is essentially the 
GIV estimator of White [1984]. They show that when s; = SBMS., 
the 3SLS estimator using the instruments ©~!Z,4; is numerically 
equivalent to the 3SLS estimator using all of the instruments Zp; = 
(Lr ® d;,er Q si). However, they also find that this equality does 
not hold when s; = Suri or Sam; In fact, without the BMS 
assumption, the set of instruments ©~1Z,,; is not legitimate in 
3SLS. This is true even if s; = syr; or Sam,- To see this, observe 
that 


E(RiQr=7*u;) = E(RiQrxuteray) + E(RiQr=“*é;) 
= E(R{Qr=‘eras) , 


where the last equality results from given strict exogeneity of R; 
with respect to ¢;. However, with unrestricted X and without the 
BMS assumption, E(RiQr=~'era;) # 0, and E- QrR; is not 
legitimate. 


Im, Ahn, Schmidt and Wooldridge provide a simple solution 
to this problem, which is to replace Qr by a different matrix that 
removes the effects, Qg = E7! — E` tep(ep L ter) tett. Clearly 
Qser = 0. Thus RiQsera; = 0, and E(RiQsu;) = E(RiQze;) = 0 
given strict exogeneity of R; with respect to ¢;. Thus, QsR, are 
legitimate instruments for 3SLS. 


This discussion motivates modified instruments of the form 
(QzRi, © ler @s;). Im, Ahn, Schmidt and Wooldridge show that 
the 3SLS estimator using these modified instruments is numeri- 
cally equivalent to the 3SLS estimator using all of the instruments 
(Lr 8 di er Q si), if the same estimate of © is used. That is, 
the modified 3SLS estimator is an efficient GMM estimator, if the 
instruments (Lr Q di, er ® s;) satisfy the NCH condition (8-6). 
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8.4 Simultaneous Equations 


In this section we consider GMM estimation of a simultaneous equa- 
tions model, with panel data and unobservable individual effects in 
each structural equation. The foundation of this section is the 
model considered by Cornwell, Schmidt and Wyhowski [1992] and 
Mátyás and Sevestre [1995, Ch. 9]: 

Yii = Yjadj + Rings + (er @ wya)yy + Uji = X58) + Uji; 

Uji = ET @ O54 + Eji 

(8-20) 

Here j = 1,...,J indexes the individual structural equation, so 
that equation (8-20) reflects T observations for individual i and 
equation j. Y;; denotes the data matrix of included endogenous 


variables. Other variables are defined similarly to those in (8-13). 
We denote Lj, = E(u;,uj,;) for j,k =1,...,J. 


In order to use the same notation for instrumental variables as 
in Section 8.3, we let R; = (rj,,...,Ti7)’ and w; be the T x k and 
1 x g data matrices of all time varying and time invariant exoge- 
nous regressors in the system, respectively. With this notation, we 
can define d;,s;,Z4; and Zg,; as in Section 8.3. Consistent with 
Cornwell, Schmidt and Wyhowski, we assume that the variables 
di = (ra,..-,Tir,wi) are strictly exogenous to the ¢; 4; that is, 
E(d; Q €;,,) = 0 for all j = 1,..., J. We also assume that a subset 
s; of the exogenous variables d; is uncorrelated with the individual 
effects aj,(j = 1,..., J). As in Section 8.3, an appropriate choice 
of s; can be made by imposing the Hausman and Taylor [1981], 
Amemiya and MaCurdy [1986], or Breusch and Mizon and Schmidt 
[1989] assumptions on d;. 


Under these assumptions, Cornwell, Schmidt and Wyhowski 
consider GMM estimators based on the moment conditions 


E(Z), iuj) =0, for j=1,...,J, (8-21) 
where the instruments Z4;, = (QrRi, er Q s;) are of the Hausman 
and Taylor [1981], Amemiya and MaCurdy [1986], or Breusch, 
Mizon and Schmidt [1989] forms as in (8-18). Clearly, the model 
(8-20) implies more moment conditions than those in (8-21). In 


the same way as in Section 8.3.2, we can show that the full set of 
moment conditions implied by the model (8-20) is 


E(Zp4%34) =0, for j=1,...,J, 
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where Zp; = (Lr @d;, er8s:) = 0. We will derive conditions under 
which the GMM (3SLS) estimator based on (8-21) is asymptotically 
as efficient as the GMM estimator exploiting the full set of moment 
conditions. 


In (8-21), we implicitly assume that the same instruments 
Za, are available for each structural equation. This assumption 
is purely for notational convenience. We can easily allow the in- 
struments Z4, to vary over different equations, at the cost of more 
complex matrix notation (see Cornwell, Schmidt and Wyhowski 
[1992], Section 3.4). 


8.4.1 Estimation of a Single Equation 


We now consider GMM estimation of a particular structural equa- 
tion in the system (8-20), say the first equation, which we write 
as 

Yii = XA + uy, (8-22) 


adopting the notation of (8-20). Using our convention of matrix 
notation, we can also write this model for all NT observations as 
yı = Xıbı + w, where X = (Yi, Ri, VW). 


GMM estimation of the model (8-22) is straightforward. The 
parameter (3, can be consistently estimated by essentially the same 
GMM or instrumental variables as in Section 8.3. Given the as- 
sumption (8-21), the GMM estimator using the instruments Z4, 
is consistent. 


We now consider conditions under which the GMM estimator 
based on (8-21) is fully efficient (as efficient as GMM based on the 
full set of moment conditions). The following Lemma provides a 
clue. 


Lemma 8.2 


(Lr 8 di)[E(LrLr 8 didi) E[(Lr 8 di) Yii] = 
QrRi[E(RiQrR;)|*E(RQrYi,). 
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PROOF OF THE LEMMA 


Consider the reduced form equations for the endogenous 
regressors Yj ;: 

Yii = Rill + (er © w;)Mhe + (er Q ailis + 013, 
where a; = (Q1;,...,@y,), and the error vı; is a linear 
function of the structural random errors ¢;,(j = 1,...,J). 
Since the variables d; = (ri1,..., rir, wi) are strictly exoge- 
nous with respect to the ¢;;, so are they with respect to v,i; 
that is, E(d;®vi,;) = 0. Note also that since L, D spans the 
columns of Qy R (see the proof of Lemma 8.1), there exists 
a conformable matrix A such that QrR; = (Lr @d;)A for 
all i. These results imply 
(Lr ® d;)[E(L7Lr ® ddi) E[(Lr ® d:)'Yi,] 
= (Lr @d;)[E(L7Lr ® did;)|"*E[(Lr 8 d;)’Qr¥i,] 
= (Lr &Q d;)[E(L7Lr (024) did;)|"'E|(Lr Q di Y Qr RM] 
= (Lr 8 di)[E(LrLr 8 didi) E[(Lr 8 di) (Lr ® d;) ATL] 
= (Lr 8 di) Allı = QrRIhi. 

However, we also have 
QrRIlE(XiQr R)I] E(RIQTY 1) = 
QrRi[E(RiQrR;)| E(R;QrR:H) = Qr Rill. 


What Lemma 8.2 means is that 
Proj(Yi; | Lr ® di) = Proj(Yi:lQrR:), 


where Proj(B; | C;) is the population least squares projection of B; 
on C;. Thus, Lemma 8.2 can be viewed as a population (asymp- 
totic) analog of Lemma 8.1. Clearly, P(L,D)R; = P(Qv R)Rı and 
P(L,D)VW, = 0 = P(QvR)VW,. These equalities and Lemma 
8.2 imply that for any conformable data matrix B; of T rows (and 
B of NT rows), 
plim N~'B’P(L,D)X, = E[Bi(Lr ® d;)|[E(L,Lr 8 didi) x 
N-0oo 
E|(Lr & di) Xıl 
= E(BiQrR;)[E(R,QrR;)]"*E(R,QrX1,) 
= plim N7'B'P(QyR)X; 


N= 
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N 
plim N~? B’P(L,D)X, = plim N`? Y BY(Lr @ di) 
N-oo N- 00 i=1 


[E(L7Lr 8 didi) 'E[(Lr 8 di) Xı,] 


N 
= plim N`? > B/QrRi[E(R;QrR:)] x 
N-00 i=l 


E(RiQrX,,) 
= plim N`? B'P(Qy R)X,. 
N-0o 


Lemma 8.2 leads to an asymptotic analog of Theorem 8.2. 


THEOREM 8.3 


Suppose that cov(ui;) = X11 is of the random effects 
form (8-3). Then, the 3SLS estimator (@,,4) using the 
instruments Z4; is asymptotically identical to the 3SLS 
estimator (Gis) using all of the instruments Z,,;. If, in 
addition, the instruments Zz, satisfy the NCH condition 
(8-6), By, a is efficient among the class of GMM estimators 
based on the moment condition E(Zp ,ui,) = 0. 


PROOF OF THE THEOREM 


Let Xıı = o2 erep + o2,I7. Without loss of generality, 
we set ož, = 1. Then, similarly to the proof of Theorem 
8.2, we can show 


Bia = [X}{0uP(L.D) + P(VS)} Xi]? 
X1{91P(L.D) + P(VS)}y:; 
Bie = [X1 {01 P(QvR) + P(VS)} Xi] 
Xi {0 P(QvR) + P(VS)}m, 
where 67, = 027,/(02, +To2,). But, Lemma 8.2 implies 
that plimy_,,, N (Ba = ĝ,B) =0. 
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8.4.2 System of Equations Estimation 


We now consider the joint estimation of all the equations in the 
system (8-20). Following our convention for matrix notation, 
structural equation j can be written for all NT observations as 
Yy; = X;8; + uj. If we stack these equations into the seemingly 
unrelated regressions (SUR) form, we have 


Ue = XB + Ux, (8-23) 


where y, = (Yj,---,Y7)', Xs = diag(X),...,X7), ue = 
(ui,-.-,w,)’ and B = (6i, ..-, 83V. We denote Q. = cov(u.). 
Straightforward algebra shows that Q. = [In 8 Ujalwrsxnrs; that 
is, if we partition Q, evenly into J x J blocks, the (j, h)'-th block 
is In ® Bjr- 

Following Cornwell, Schmidt and Wyhowski [1992], we consider 
the system GMM estimator based on the moment conditions (8-21). 
Define Zf = IJ& Z4; and uf = (ul i- --, Wj;)', so that we can write 
all of the moment conditions (8-21) compactly as E(ZS'u$) = 0. 
Use of these moment conditions leads to the system GMM estimator 


Bsamm = [XZE (VE) ZI XJ XIZI (V Zy., 
where ZS = I; ® Z4 and 


a N 
ijih iZA,i] . 


N N 
Ve = NOS) zaa a SNY Zh 
i=1 i=1 
The 3SLS version of this system estimator can be obtained if we 


replace V$ by 


N 
NZE, Z = N J [Z122]; 
i=1 


that is, 
Bsssrs = [X1 ZF (ZF ZS) ZE X] X ZE (ZE Q, ZS) ZS Ye. 


In fact, this system 3SLS estimator is a generalization of the 3SLS 
estimator proposed by Cornwell, Schmidt and Wyhowski [1992]. 
To see this, we make the following assumptions, as in Cornwell, 
Schmidt and Wyhowski [1992, Assumption 1, p. 157]: 
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AssumpTION 8.1 


(i) The individual effects for person i, a; = (Q1,,...,@y;)! 
are i.i.d.(0,5q). (ii) The random errors for person i at 
time t, (€1,:t,---,€sit)’, are i.i.d. (0, 5e). (iii) All elements 
of a are uncorrelated with all of elements of e. (iv) Xa and 
be are nonsingular. 


Under Assumption 8.1, we can show 
Q. = E; @Inr + Xa 8 (TPy) = E1 8 Qv + Le @ Py, 


where ©, = b, and Xs = ©,+T,. Using this result and the facts 
that Ze = I; 8 Za = (I; ® QvR, Ij Q VS) and R'QvVS = 0, we 
can easily show that 


Z3(Z2 0,28) ZP = Sy @ QvR+ Zz" OVS. 


Substituting this result into the system 3SLS estimator BsssLs 
yields the 3SLS estimator of Cornwell, Schmidt and Wyhowski 
[1992, p. 164, equation (34)]. Thus 653575 simplifies to the Corn- 
well, Schmidt and Wyhowski 3SLS estimator when the errors have 
the random effects covariance structure implied by Assumption 
8.1; otherwise, it is the appropriate generalization of the Cornwell, 
Schmidt and Wyhowski estimator. 


8.5 Dynamic Panel Data Models 


In this section we consider a regression model for dynamic panel 
data. The model of interest is given by: 


Yi =Y,16+ RE+ (er @w)ytu=SXB+uy; uz = er 8a; + Ei, 

(8-24) 
where y;,-1 = (Yio, --- Yi T-1)', Yio is the initial observed value of 
y (for individual 2), and other variables are defined exactly as in 
(8-13). 


The basic problem faced in the estimation of this model is 


that the traditional within estimator is inconsistent, because the 
within transformation induces a correlation of order Ł between 


the lagged dependent variable and the random error (see Hsiao 
[1986]). A popular solution to this problem is to first difference 
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the equation to remove the effects, and then estimate by GMM, 
using as instruments values of the dependent variable lagged two 
or more periods as well as other exogenous regressors. Legitimacy 
of the lagged dependent variables as instruments requires some 
covariance restrictions on ¢€;, a; and yio. However, these covariance 
restrictions imply more moment conditions than are imposed by the 
GMM estimator based on first differences. In this section, we study 
the moment conditions implied by a standard set of covariance 
restrictions and other alternative assumptions. We also examine 
how these moment conditions can be efficiently imposed in GMM. 


A good survey, which emphasizes somewhat different aspects 
of the estimation problems, is given in Mátyás and Sevestre [1995, 
Ch. 7]. 


8.5.1 Moment Conditions Under Standard 


Assumptions 


In this subsection we count and express the moment conditions 
implied by a standard set of assumptions about a;, €i and yip. For 
simplicity, and without loss of generality, we do so in the context of 
the simple dynamic model whose only explanatory variable is the 
lagged dependent variable: 


Yi = 6Yy;,-1 + Ui; Us = er 8 a; + Ei. (8-25) 


Consistently with the previous sections, we assume that a; and E; 
have mean zero for all ¿ and t. (Nonzero mean of a can be handled 
with an intercept which can be regarded as an exogenous regressor.) 
We also assume that E(y;o) = 0. We make this assumption in order 
to focus our discussion on the (second-order) moment conditions 
implied by covariance restrictions on ¢;,a; and yio. If E(yio) Æ 0, 
the first-order moment conditions E(u) = 0 (t = 1,...,T) are 
relevant in GMM. Imposing these first-order moment conditions 
could improve efficiency of GMM estimators, as Crépon, Karamarz 
and Trognon [1995] suggest. In contrast, if E(yio0) = 0 (and E(a;) = 
0), the first-order moment conditions become uninformative for the 
unknown parameter 6 because they cannot identify a unique 6. This 
is so because for any value of 6, 


E(uit) = E(yit = 6Yi,t-1) = E(yit) = b6E(yit-1) = 0. 
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The following assumptions are most commonly adopted in the dy- 
namic panel data literature. 


ASSUMPTION 8.2 


(i) For all ¿, €; is uncorrelated with yio for all t. (ii) For 
all i, €; is uncorrelated with a; for all t. (iii) For all 2, the 
€ are mutually uncorrelated. 
| 


Under Assumption 8.2, it is obvious that the following moment 
conditions hold: 


E(Yis Auu) =0, t=2,...,T, s=0,...,t-2, (8-26) 


where Auj = Uit — Uie-1 = Et — €ix-1- There are T(T — 1)/2 
such conditions. These are the moment conditions that are widely 
used in the panel data literature (e.g., Anderson and Hsiao [1981], 
Holtz-Eakin [1988], Holtz-Eakin, Newey and Rosen [1988], Arellano 
and Bond [{1991]). However, as Ahn and Schmidt [1995] find, 
Assumption 8.2 implies additional moment conditions beyond those 
in (8-26). In particular, the following T —2 moment conditions also 
hold: 

E(u;rAuit) = 0, t= 2, eee T = 1 (8-27) 


which are nonlinear in terms of 6. 


The conditions (8-26) and (8-27) are a set of T(T —1)/2+(T— 
2) moment conditions that follow directly from the assumptions 
that the € are mutually uncorrelated and uncorrelated with a; 
and yip. Furthermore, they represent all of the moment conditions 
implied by these assumptions. A formal proof of the number of 
restrictions implied by Assumption 8.2 can be given as follows. 
Define On = var(En), Caa = var(a;) and coo = var(yio). Then, 
Assumption 8.2 imposes the following covariance restrictions on 
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the initial value y,;9 and the composite errors uj,..., Ur: 
an Ai An «+» Ar Ato 
t 
A21 A22 --» Aw Azo 
Uiz . . . . 
A = cov : = 
an Art Ara .-- Arr Aro 
v 
Aor =Ao2 «+. Aor Aoo 
Wio 
’ 
(Caa T O11) Oaa one Oaa Tda 
Oaa (Oxa + 022) eee Oaa Oda 
= : a 
Oaa Oaa eee (Caa E Orr) 90a 
Toa Toa tee Toa Toa 
There are T — 1 restrictions, that Ap; is the same for t = 1,..., A 


and T(T — 1)/2 — 1 restrictions, that As is the same for t, s= 
...,1, t # s. Adding the number of restrictions, we get T(T = 
1)/2+ (T — 2). 


Since the moment conditions (8-27) are nonlinear, GMM im- 
posing these conditions requires an iterative procedure. Thus, an 
important practical question is whether this computational burden 
is worthwhile. Ahn and Schmidt [1995] provide a partial answer. 
They compare the asymptotic variances of the GMM estimator 
based on (8-26) only and the GMM estimator based on both of 
(8-26) and (8-27). Their computation results show that use of the 
extra moment condition (8-27) can result in a large efficiency gain, 
especially when 6 is close to one or the variance Caa is large. 


8.5.2 Some Alternative Assumptions 


We now briefly consider some alternative sets of assumptions. The 
first case we consider is the one in which Assumption 8.2 is aug- 
mented by the additional assumption that the £; are homoskedas- 
tic. That is, suppose that we add the assumption: 
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ASSUMPTION 8.3 


For all i, var(€,,) is the same for all t. 
a 


This assumption, when added to Assumption 8.2, generates the 
additional (T — 1) moment conditions that 


E(u?) isthe same for t=1,...,T. (8-28) 


(In terms of A above, Ax is the same for t = 1,...,T.) Therefore 
the total number of moment conditions becomes T(T —1)/2+(2T — 
3). These moment conditions can be expressed as (8-26)-(8-28). 
Alternatively, if we wish to maximize the number of linear moment 
conditions, these moment conditions can be expressed as (8-26) 
plus the additional conditions 


E (yin Auit41 z Yi,t+1 AUit+2) =0, t=1,...,T—2, (8-29) 
E(G@;Auit41) = 0, t= 1, e.g T— 1, (8-30) 


where u; = T7! ee ui. Comparing this to the set of moment 
conditions without homoskedasticity (8-26)-(8-27), we see that 
homoskedasticity adds T — 1 moment conditions and it allows T — 2 
previously nonlinear moment conditions to be expressed linearly. 


Ahn and Schmidt [1995] quantify the asymptotic efficiency gain 
from imposing the extra moment conditions (8-27) and (8-28) (or 
equivalently, (8-29) and (8-30)) in addition to (8-26). Their results 
show that most of efficiency gains come from the moment condition 
(8-27). That is, we do not gain much efficiency from the assumption 
of homoskedasticity (Assumption 8.3). 


Another possible assumption we may impose on the model 
(8-25) is the stationarity assumption of Arellano and Bover [1995]: 


AssumPTION 8.4 


cov(a:, yit) is the same for all t. 
a 


This is an assumption of the type made by BMS (see (8-17)); it 
requires equal covariance between the effects and the variables with 
which they are correlated. Ahn and Schmidt [1995] show that, given 
Assumption 8.2, Assumption 8.4 corresponds to the restriction that 


Toa = Faa/(1 — 6) (8-31) 
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and implies one additional moment restriction. Furthermore, they 
show that it also allows the entire set of available moment condi- 
tions to be written linearly; that is, (8-26) plus 
E(ugrAyit) = 0, t= 1,...,7- 1; 
E(uityit — Uit-1Yit-1) =0, t= 2,...,T. 
This is a set of T(T — 1)/2 + (2T — 2) moment conditions, all 
of which are linear in 6. Blundell and Bond [1997] show that the 
GMM estimator exploiting all of these linear moment conditions has 
much better asymptotic and finite sample properties than the GMM 


estimator based on (8-26) only. Thus the stationarity Assumption 
8.4 may be quite useful. 


Finally, we consider an alternative stationarity assumption, 
which is examined by Ahn and Schmidt [1997]: 


Assumption 8.5 


In addition to Assumptions 8.2 and 8.3, the series 
Yio,---, Yir is Covariance stationary. 
a 


To see the connection between the two stationarity assumptions, 
Assumptions 8.4 and 8.5, we use the solution 
t—1 
Yit = Yio + aill — 6*)/(1— 5) +Y Seit 
j=0 
to calculate 
var(yit) = 0006" + F0a265*(1 — 6*)/(1 — 6) 

+ Goa[(1 — 6) /(1 — 8)? + ocell — 6) /(1 — 8), 
where the calculation assumes Assumptions 8.2 and 8.3. Assump- 
tion 8.5 implies that var(yit) = coo for all t, which occurs if and 
only if Caa = (1 — 4)ooq and also 

G00 = Faa/(1 — 6)? + Cee / (1 — 6’). (8-32) 
Thus Assumption 8.5 implies caa = (1—5)¢0a, which in turn implies 
Assumption 8.4. However, it also implies the restriction (8-32) on 
the variance of the initial observation y;ọ. Imposing (8-32) as well 
as Assumptions 8.2-8.4 yields one additional, nonlinear moment 
condition: 


Elyio + ya Aui/(1 — 6) — uiguar/(1 — 6)7] = 0. 


Dynamic Panel Data Models 241 


An interesting question that we do not address here is how 
many moment conditions we would have if the assumptions dis- 
cussed above are relaxed. Ahn and Schmidt [1997] give a partial 
answer by counting the moment conditions implied by many pos- 
sible combinations of the above assumptions. See that paper for 
more detail. 


8.5.3 Estimation 


In this subsection we discuss some theoretical details concerning 
GMM estimation of the dynamic model. We also discuss the rela- 
tionship between GMM based on the linear moment conditions and 
3SLS estimation. Our discussion will proceed under Assumptions 
8.2-8.3, but can easily be modified to accommodate the other cases. 


8.5.3.1 Notation and General Results 


We now return to the model (8-24) which includes exogenous re- 
gressors r; and w;. Exogeneity assumptions on fr; and w; generate 
linear moment conditions of the form 


where C; = Z4,; or Zp, as defined in Section 8.3. In addition, the 
moment conditions given by (8-26), (8-29) and (8-30) above are 
valid. The moment conditions in (8-26) above are linear in 8 and 
can be written as E(A‘u;) = 0, where A; is the T x T(T — 1)/2 
matrix 


— yio 0 EDY 0 
Yio — (yio, Yi1) soap 0 
Po : (Viow Yin) el 
0 0 ee — (Yio, Yir +++ Yi T-2) 
0 0 wes (Vio Yin- -< Yi,7-2) 


Similarly, the moment conditions in (8-29) above are also linear in 
8 and can be written as E(Biu;) = 0, where B; is the T x (T — 2) 
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matrix defined by 


Ya 0 0 
(Ya + yiz) —Yi2 0 
—Yi2 (yiz + Yiz) 0 
0 0 iS —Yi,T—2 
0 0 wee (Yir-2 + YiT-1) 
0 0 sees —Yi,T-1 


However, the moment conditions in (8-30) above are quadratic in 
b. 

We will discuss GMM estimation based on all of the available 
moment conditions and GMM based on a subset (possibly all) of the 
linear moment conditions. Let H; = (C;, Ai, B;), which represents 
all of the available linear instruments. The corresponding linear 
moment conditions are E[m;(8)] = 0, with 


m(b) = Hyu; = Mii + Maib, my = H;yi, M; = —H;X;. 
The remaining nonlinear moment conditions will be written as 
E[|g:(8)| = 0. Since they are at most quadratic, we can write 

9i(B) = gri + 92 + (Iq ® B') 93:8, 
where gii, go; and gz; are conformable matrices of functions of 
data and q is the number of moment conditions in g;. An efficient 


estimator of 8 can be obtained by GMM based on all of the moment 
conditions 
Elf:(8)|] = Elmi(B)’, 9:(B)'Y = 0. 
Define fy = NHE, fi(8), with my, min, Mon, JN, Jin, Jon and 
gsn defined similarly; and define 
Fy = Ofy/0f' = [Omiy/0B, Ogn/OB)' = [My GN], 

where Gy(8) = gen + 2(I Q B’)gsn and My = moy. Let 
F = plim Fy, with M and G defined similarly. Define the optimal 
weighting matrix 

V, V, 

V= | mm mg 
Vom Vog 


| = EGH. 


Let Vy be a consistent estimate of V of the form 


Vv = N? Y FD ABY, 


i=1 
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where B is an initial consistent estimate of 8 (perhaps based on 
the linear moment conditions m;, as discussed below); partition it 
similarly to V. 


In this notation, the efficient GMM estimator Brom mini- 
mizes 


N fn (B) Vn fn (B) - 


Using standard results, the asymptotic covariance matrix of 
N? (Bramm — b) = [F'O F]. 


8.5.3.2 Linear Moment Conditions and Instrumental 
Variables 


Some interesting questions arise when we consider GMM based 
on the linear moment conditions m;(8) only. The optimal GMM 
estimator based on these conditions is 


Bm = [my (Vim)? Man] my (Viv mm) Min 
= [X’H(Vy mm) 1H’ XJ X’ H (Vy mm) X'y. 


This GMM estimator can be compared to the 3SLS estimator which 
is obtained by replacing Vyn,mm by NTH'QH = N EX HISH, 
As discussed in Section 8.1, they are asymptotically equivalent in 
the case that Vmm = E(Hju,ui,H;) = E(H{DH;). For the case that 
H; consists only of columns of C;, so that only the moment condi- 
tions E(Ciu;) = 0 based on exogeneity of R; and w; are imposed, 
this equivalence may hold. Arellano and Bond [1991] considered the 
moment conditions (8-26), so that H; also contains A;, and noted 
that asymptotic equivalence between the 3SLS and GMM estimates 
fails if we relax the homoskedasticity assumption, Assumption 8.3, 
even though the moment conditions (8-26) are still valid under 
only Assumption 8.2. In fact, even the full set of Assumptions 
8.2-8.3 is not sufficient to imply the asymptotic equivalence of the 
3SLS and GMM estimates when the moment conditions (8-27) are 
used. Assumptions 8.2-8.3 deal only with second moments, whereas 
asymptotic equivalence of 3SLS and GMM involves restrictions on 
fourth moments (e.g., cov(y%,€7,) = 0). Ahn [1990] proved the 
asymptotic equivalence of the 3SLS and GMM estimators based 
on the moment conditions (8-26) for the case that Assumption 


4 Note that Assumptions 8.2 and 8.3 implies the random effects structure 
(8-3). 
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8.3 is maintained and Assumption 8.2 is strengthened by replacing 
uncorrelatedness with independence. Wooldridge [1996] provides 
a more general treatment of cases in which 3SLS and GMM are 
asymptotically equivalent. In the present case, his results indicate 
that asymptotic equivalence would hold if we rewrite Assumptions 
8.2-8.3 in terms of conditional expectations instead of uncorrelat- 
edness; that is, if we assume 


E(Ei | Yio, Qi, Ei, ++- ,Eit-1) = 0, 
E(ei, | Yio, Xis Eily» ,Eit-1) = Oce- 


A more novel observation is that the asymptotic equivalence of 
3SLS and GMM fails whenever we use the additional linear mo- 
ment conditions (8-29). This is so even if Assumptions 8.2-8.3 
are strengthened by replacing uncorrelatedness with independence. 
When uncorrelatedness in Assumptions 8.2-8.3 is replaced by in- 
dependence, Ahn [1990, Chapter 3, Appendix 3] shows that, while 


E(Aju;u,A;) = Ces E(A,A;) = E(A,XA;) 
E(Ajuju,B;) = Ces E(A;,B;) = E(4;,£B;), 
E(B;uiu;B:) = Cee E(B; B) + (K + Cee)Lr-i Lr- 
= E(B; £B;) + (K + Cee) Lr-iLr-1, 
where s = E(e*)—302, and Lr- is the (T—1) x (T—2) differencing 


matrix defined similarly to Ly in Section 8.1. Under normality 
k = 0 but the term o,,L,_,L7r_1 remains. 


8.5.3.3 Linearized GMM 


We now consider a linearized GMM estimator. Suppose that B is 
any consistent estimator of 3; for example, Bm. Following Newey 
[1985, p. 238], the linearized GMM estimator is of the form 


a~ ~ ~, ~ a, A, 
Bramm = b — [Fn (8) (Vn) Fn (b) En (8Y (Vn) fn (b). 
This estimator is consistent and has the same asymptotic distribu- 

tion as Begun. 

When the LGMM estimator is based on the initial esti- 
mator m, some further simplification is possible. Applying 
the usual matrix inversion rule to Vy and using the fact that 


Mon Vy, 1 mMy(ĝÂm) = 0, we can write the LGMM estimator as 
follows: 


Bramm = Bm — [En + By(Vw æ) Bw] By (VN, o) bn, 
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where 

Dy = my(Vimm) Man, VNo = VN,gg — VN,gm(VN,mm) 'Vnv,mgs 
by = gn (Bm) — Vn ,om(VN,mm) mn (bm), and 

By = Gy(Bm) — VN,gm(VN,mm) ‘Moan. 

For more detail, see Ahn and Schmidt [1997]. 


8.6 Conclusion 


In this chapter we have considered the GMM estimation of linear 
panel data models. We have discussed standard models, including 
the fixed and random effects linear model, a dynamic model, and 
the simultaneous equation model. For these models the typical 
treatment in the literature is some sort of instrumental variables 
procedure; least squares and generalized least squares are included 
in the class of such instrumental variables procedures. 


It is well known that for linear models the GMM estimator 
often takes the form of an IV estimator if a no conditional het- 
eroskedasticity condition holds. Therefore we have focused on 
three related points. First, for each model we seek to identify 
the complete set of moment conditions (instruments) implied by 
the assumptions underlying the model. Next, one can observe 
that the usual exogenety assumptions lead to many more moment 
conditions than standard estimators use, and ask whether some 
or all of the moment conditions are redundant, in the sense that 
they are unnecessary to obtain an efficient estimator. Under the no 
conditional heteroskedasticity assumption, the efficiency of stan- 
dard estimators can often be established. This implies that the 
moment conditions which are not utilized by standard estimators 
are redundant. Finally, we ask whether anything intrinsic to the 
model makes the assumption of no conditional heteroskedasticity 
untenable. In some models, such as the dynamic model, this as- 
sumption necessarily fails if the full set of moment conditions is 
used, and correspondingly the efficient GMM estimator is not an 
instrumental variables estimator. 


The set of non-redundant moment conditions can sometimes 
be very large. For example, this is true in the dynamic model, and 
also in simpler static models if the assumption of no conditional 
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heteroskedasticity fails. In such cases the finite sample properties 
of the GMM estimator using the full set of moment conditions may 
be poor. An important avenue of research is to find estimators 
which are efficient, or nearly so, and yet have better finite sample 
properties than the full GMM esitmator. 
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Chapter 9 


ALTERNATIVE GMM METHODS FOR 


NONLINEAR PANEL DATA MODELS 


Jorg Breitung and Michael Lechner 


In recent years the GMM approach became increasingly popular 
for the analysis of panel data (e.g., Avery, Hansen and Hotz [1983], 
Arrelano and Bond [1991], Keane [1989], Lechner and Breitung 
[1996]). Combining popular nonlinear models used in microecono- 
metric applications with typical panel data features like an error 
component structure yields complex models which are too compli- 
cated or even intractable to be estimated by maximum likelihood. 
In such cases the GMM approach is an attractive alternative. 


A well known example is the probit model, which is one of 
the work horses whenever models with binary dependent. vari- 
ables are analyzed. Although the nonrobustness of the probit 
estimates to the model’s tight statistical assumptions is widely 
acknowledged, the ease of computation of the maximum likelihood 
estimator (MLE)—combined with the availability of specification 
tests—make it an attractive choice for many empirical studies based 
on cross sectional data. The panel data version of the probit model 
allows for serial correlation of the errors in the latent equations. 
The problem with these types of specifications is, however, that the 
MLE becomes much more complicated as in the case of uncorrelated 
errors. 


Two ways to deal with that sort of general problems have 
emerged in the literature. One is the simulated maximum likeli- 
hood estimation (SMLE). The idea of this technique is to find an 
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estimator that only approximates the MLE but retains the asymp- 
totic efficiency property of the exact MLE. SMLE uses stochastic 
simulation procedures to obtain approximate choice probabilities 
(see e.g., Bérsch-Supan and Hajivassiliou [1993], or Hajivassiliou, 
McFadden and Ruud [1996]). The problem with these methods 
is that they can be very computer intensive, and it may still be 
difficult to estimate all parameters of the covariance matrix jointly 
with the regression coefficients. 


An alternative approach sacrifices some of the asymptotic ef- 
ficiency in order to obtain a simple GMM estimator. Since GMM 
estimators are consistent in the case of serially correlated errors, 
it is then not necessary to obtain joint estimates of the covariance 
parameters and the regression coefficients. These estimators are 
based on the fact that a panel probit model implies a simple probit 
model when taking each period separately. Therefore, simple mo- 
ment conditions can be derived from the individual cross sections 
and asymptotic theory can be used to minimize the efficiency loss 
implied by such a procedure. Examples of this kind of estimators 
can be found in Avery et al. [1983] and Chamberlain [1980], [1984]. 


In a number of recent papers various other GMM estimators 
based on these ideas are suggested and compared asymptotically 
and by means of Monte Carlo simulations (Breitung and Lechner 
[1996], Bertschek and Lechner [1998], and Lechner and Breitung 
[1996]). The results of these studies are quite promising. The 
appropriate GMM estimator provides an estimation procedure that 
is robust, flexible, easy and fast to compute, and results in a small 
(in some case negligible) efficiency loss compared to full information 
maximum likelihood. 


In this chapter we review some earlier work on the GMM 
estimation of nonlinear panel data models and suggest some ‘new 
estimators as well. In particular we consider the joint estimation 
of mean and covariance parameters. Most of the previous studies 
focused on first order moment conditions, 7.e., restrictions on the 
conditional mean, while here we will consider restrictions on higher 
moments as well. 


The chapter is organized as follows. Section 9.1 defines the 
nonlinear panel data model and gives some examples. Section 9.2 
sketches the earlier work concerning the GMM estimation of para- 
meters for the conditional mean. Section 9.3 considers higher order 
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moments for estimating the complete parameter vector and Section 
9.4 applies the new approach of Gallant and Tauchen [1996] to select 
appropriate moment conditions. A minimum distance version of the 
resulting estimator is suggested in Section 9.5. Section 9.6 presents 
the results of a small Monte Carlo experiment and in Section 9.7 the 
estimation procedures are applied to an empirical example. Section 
9.8 concludes. 


9.1 A Class of Nonlinear Panel Data 
Models 


Let yz be an m x 1 vector of jointly dependent variables and £; is 
a k x 1 vector of independent variables. The indices i = 1,...,N 
and t = 1,...,T indicate the cross section unit and the time period 
of the observation. It is convenient to stack the observations into 
matrices such that Y; = [y,..., yer]! and X; = [va,..., vir)’. Next 
we consider the asymptotic properties for T fixed and N — oo. 


For the time period t, the nonlinear model is characterized 
by its conditional density function h(yi|£it; 0o) (cf. Gouriéroux 
[1996]). Defining the conditional mean function as E(y|r%) = 
Hı(Tit; 90), the model may be rewritten as 


Yit = Mr (Lit; Ao) + Vie » (9-1) 
where E(viz|viz) = 0 and E(vitvje|Ciz, 254) = 0 for i Æ j. 


EXAMPLE 9.1 Binary Choice Model 


Let yj, = Libo + Ei, where e; = [€:1,...,€:7r]' is iid. and 
Ex has a constant marginal distribution function F,(z). If 
ya > 0, we observe yi = 1. Otherwise we have y;, = 0. 
The conditional mean function for this model is given by 


E(yielvie) = p (it; 00) = Fe (24,80). 
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EXAMPLE 9.2 Multinomial Logit Model 


Let P; i: denote the probability of choosing alternative j = 
0,1,...,J given by 


exp (41/05) 


Pit = J , 
1+ X exp(xi249) 
j=l 


’ 


where 3° = 0. We define J indicator variables! y; i. (j = 
1,..., J) which takes a value of one if alternative 7 is chosen 
by individual 7 at time period t and zero otherwise. The 
conditional mean function for yi = [yi,it,.--, Ysi] is 
Py it 
E(yit|Lit) = L1 (Lit; Po) = 
Pju 


EXAMPLE 9.3 Poisson Model 


Let yi be a integer valued random variable drawn from a 
Poisson distribution with conditional mean function 


Hr(Xit3 0o) = exp(2j,). 


In the examples we give the conditional mean functions for an 
individual i at time t. Stacking the means of different time periods 
into a T x m matrix gives 


H1(2i1; 90)’ 
Ha( Xi; 00) = : 

Mi (zir; 90)’ 
We may include lagged or lead values of the exogenous variables in 
Zit, SO that the conditional mean may depend on the exogenous vari- 


ables from other time periods. Moreover, we may include weakly 
exogenous variables in the sense that x; is uncorrelated with vi 


1 Note, that the moment condition for yo,it is redundant because the proba- 


bilities of the choices sum up to one. 
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but may be correlated with v;, for s Æ t. Accordingly, the model 
may include weakly or strongly exogenous variables. However, it 
is important to notice that the mean function is the same for all 
cross section units. This excludes many models with heterogeneity 
in the mean such as fixed effects models.” 


Unobserved heterogeneity may be represented by including an 
individual specific random effect a;. For example, a random com- 
ponent version of the binary choice model may be constructed as 
in the following example. 


ExampLe 9.4 Error Components Probit Model 


Let y}, be a latent variable given by 
Vit = Lelo + Ai + Eit, 


where 2;; is a vector of strongly exogenous variables, a; and 
€ are mutually and serially uncorrelated random variables 
distributed as a; ~ N (0, o2) and c ~ N(0,02), where we 
normalize the variances as øo? + o2 = 1. The observed 
dependent variable y; is one if y3, > 0 and zero otherwise. 
The parameter vector of the model is 6 = [{%, pol’, where 
po = 02/(o2 + 02) denotes the serial correlation and the 
mean function is identical to the one of Example 9.1, where 
F(z) is the standard normal c.d.f. 
a 


This error component probit model is a special case of Example 9.1 
and will serve as the leading example in what follows. 


2 The reason is that in nonlinear models it is not easy to deal with hetero- 


geneity in the mean. For discrete choice models there are some special 
cases allowing for fixed effects estimation such as count data models and 
conditional binary logit models. If the latent dependent variable is partly 
observable, as in censored or truncated regression models, semiparametric 
methods may be an attractive alternative (cf. Honoré [1992], [1993]). 
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9.2 GMM Estimators for the 
Conditional Mean 


Assume that we are interested in the conditional mean function 
given by uı(X:;0). In many applications, the mean function does 
not depend on the complete parameter vector. In this case we may 
write (X39) = (X; 0®), where 0 is a pı x 1 subvector of 8. 
The remaining parameters are treated as nuisance parameters. For 
example, in the error components probit model given in Example 
9.4 we may be interested in 6) but not in the correlation coeffi- 
cient pọ. Accordingly, it is convenient to focus on the conditional 
mean when constructing a GMM estimator. We define first order 
moments as 


fi (Yi, Xi; g))) = Yi — Hı (X;; 0®) (9-2) 
with the moment condition 
E| fi (X; 0PX] = 0. (9-3) 


Using the law of iterated expectations the conditional moment 
restrictions can be expressed as a set of unconditional moment 
restrictions 

E[B(X) fi(yis X; 85”)] = 0, (9-4) 


where B(X;) is a pı x T matrix of functions on X;. The optimal 
choice of B(X;) is (cf. Newey [1993]) 


BY(X;) = C - D(X; 0P VAUX bo), (9-5) 
where 
D(X; 0®) =ElOfi(yi, X 0) /00 |X] 
(Xi; 0) =EL, Si (yi, Xi 0) fr (yi, X OY) |X] 
and C is some nonsingular squared matrix. 


For the panel probit model (Example 9.4) we obtain: 


—$9(X}1 Po) Fi1 
D(Xi; 8) = : 
—9(Xj7 Po) Xr 
Q(X; p, p) = (Wisi) (9-6) 
Wts,i(Lit, Lis; B, P) = 
{ (xub) — P18) for t=s 
6°) (xB, 2/,8, p) — P(x B)(x., 8) otherwise 
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where ¢(2/,3) denotes the p.d.f. of the univariate standard nor- 
mal distribution and 6) (zx/,G, x/,3, p) indicates a bivariate normal 
c.d.f. with correlation p. 


It is important to note that D(X;; 3) does not depend on the co- 
variance parameter p, whereas the computation of wrs requires the 
value of p. Accordingly, for the asymptotically optimal instruments 
we need estimates of the mean and the covariance parameters. 
There are several strategies to deal with this problem. First, p 
may be fixed at some arbitrary level, say p = 0. This gives a com- 
putationally simple estimator that is consistent but inefficient. The 
second possibility is to apply some rough approximation such as the 
“small sigma approximation” (cf. Breitung and Lechner [1996}]). 
Again, an efficiency loss may result from this approximation. 


The third possibility is to estimate T(T — 1)/2 bivariate probit 
models to obtain direct estimates. This approach is asymptotically 
efficient, but quite time consuming. In particular, convergence 
problems may occur, as it often happens in fairly small sample 
sizes. Finally, Bertschek and Lechner [1998] suggest the use of 
nonparametric methods, (e.g., the k—-nearest neighbor method) to 
avoid the estimation of the correlation coefficient. Based on an 
empirical application and an extensive Monte Carlo study they find 
that the nonparametric GMM estimator approaches the efficiency 
of the MLE. 


Another possibility is to use a high dimensional simple function 
such as 
B(X;) = Ir & vec(X;y 


or 
Li /S8in 0 Sea 0 
0 72 Si 0 
BOYS ||. Tam ON 
0 0 Sasha! Tir [Sir 


where si = ®(2/,3){1 — ®(x1,8)] (see Avery et al. [1983], Breitung 
and Lechner [1996]). The main problem with this approach is 
that a large number of moment conditions is needed to approach 
the efficient GMM estimator. In finite samples, however, GMM 
estimators perform poorly if the number of moment conditions gets 
large relative to the sample size (e.g., Breitung and Lechner [1996]). 
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9.3 Higher Order Moment Conditions 


So far we confine ourselves to the moment restrictions for the condi- 
tional mean. Accordingly, this approach does not provide estimates 
for other parameters like the conditional variance. Moreover, the 
efficiency of the GMM procedure may be improved by considering 
higher order moment conditions. If y; is univariate (as for the 
cross section probit model, for example) we may define moments of 
degree k as 


Fa(yi, 239) = [ys — p(z; 0)]* — p(x; 9) 
where up = E{[y; — (Xi; 6)]*|z;}. The corresponding moment 
condition is 
El fr (Yi, £i; 90)|ei] = 0 . (9-7) 
If y; is a vector, all relevant moments are stacked into an appro- 
priate vector. For the panel probit model the typical element for a 
vector with k = 2 is given by 


L Ci, Xi; B, p) = [ye — (2p )[Yis — B(2),,8)] — wts,il(Lit, Liss B, Pp), 

(9-8) 

where wrs, is defined as in (9-6). For t # s, the moment condition 

requires the evaluation of the bivariate normal c.d.f. and is, there- 

fore, more complicated than imposing just the conditional mean 
restrictions. 


Intuitively, as we include an increasing number of moment 
conditions, the moment conditions will give an accurate charac- 
terization of the conditional distribution. Therefore, the GMM 
estimator will tend to the MLE. There are, however, serious prob- 
lems with such an approach. First, higher order moment conditions 
easily become quite complicated so that the GMM procedure may 
even be more burdensome than the corresponding MLE. Second, 
the number of moment conditions increase rapidly with k. It is 
therefore desirable to have an alternative method for generating 
moment conditions rather than considering higher order conditional 
moments. 
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9.4 Selecting Moment Conditions: 
The Gallant—Tauchen Approach 


Gallant and Tauchen [1996] suggest to use an auxiliary (possibly 
misspecified) model to generate moment conditions from the scores 
of the pseudo MLE. The idea behind this approach is that the scores 
of an accurate representation of the main features of the model (the 
score generator) may provide efficient moment conditions. In fact, 
if the auxiliary model “smoothly embed” the structural model, then 
the GMM estimator derived from the score generator is asymptot- 
ically efficient (Gallant and Tauchen [1996}). 


Consider the panel probit model given in Example 9.4. Using 
a linear error component model y; = £y + af + ej, as auxiliary 
model gives rise to the following moments? 


b oe " 
8n1(A) al ONT X (uit = puis) Lit (9-9) 
e* t=1 i=1 
sma) = Ea -o — Lo (9-10) 
3 N = t Q T E 
1 T N Day é 
8y,3(A) = NTI) > 2- (Uit an Ui) — Og» ; (9-11) 


where à = [y,02.,02.]', ut = Ye — LY, 1 = TE ur, and 
t 

p = To2./(To2. + 02.). It is important to note that the variance 

of the errors in a probit model is not identified and is therefore set 

to one. As a consequence the moment sy,3(A) is dropped to obtain 

a nonsingular covariance matrix of the conditional moments. 


Although the linear model is misspecified, it may approximate 
the crucial features of the underlying nonlinear model. Whenever it 
is possible to derive the relationship between the parameters of the 
structural and auxiliary model, we are able to compute estimates 
for the structural model from the estimated auxiliary model. 


Let 6 denote the vector of structural parameters, (i.e., 0 = 
[@’, p]’ in our example) and 


fn (0, A) = Eolsn(A)], (9-12) 


3 These expressions are derived from the scores given in Hsiao [1986, p. 39]. 
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denotes the expected scores, where 
ny (A) = [s11 (A), 8n,2(A), 8n,3(A)]’ 


and Ep indicates the expectation with respect to the structural 
model given the parameter vector 0. Then, Gallant and Tauchen 
[1996] suggest to estimate 6 by minimizing the objective function 


ber = argming{ fn (0, X) An fy (0, \)}, (9-13) 


where À is the pseudo MLE given by s NÀ) = 0 and Ay tends to the 
optimal weight matrix as N — oo. If the dimension of fy(6,\) is 
the same as the number of structural parameters, we can compute 
ber from solving fu (or, A) = 0. 


In our example, the moments fy(6, À) can be computed by 
using the bivariate normal c.d.f. (see Section 9.2). Another pos- 
sibility, which is particularly useful for more complicated models, 
is to approximate (9-12) using Monte Carlo techniques (see, e.g., 
Gouriéroux and Monfort [1993] and Gallant and Tauchen [1996]). 


To compute the GMM estimator we need (estimates for) the 


derivatives = 
OEo{ fv, Al} 
og’ A 
which is a quite complicated function in general. It is important 
to notice that in (9-12) the expectation is computed by treating ÀA 
as given while in (9-14) À is a random variable depending on 9. In 
practice, this expression is estimated using simulation methods. 


Fry (9) = (9-14) 


Usually, the variance of the GMM estimator with optimal 
weight matrix is estimated as 


Vy = [Fr (®' An Fu (®]. 
However, for the consistency of this estimator it is required that the 


auxiliary model is correctly specified so that there exists a value \* 
such that the “score contributions” satisfy 


Eg[s(yi,Xi3A*)] =O for alli =1,...,N, 
where sy(A) = X 8(yi, Xi;A). For misspecified auxiliary models 


we therefore need to apply a different estimator for the covariance 
matrix of 0 given by 


Vy = 
[Fn (6) An Fn (@)|-* Fw (0) AnD An Fy (Ô) (Fw (0) An Fw (0)? 
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where 


zl- 


Ro s(yi, Xi; X) — mi (X) [s (yi Xi X) — mA] (%15) 


and 
mi(A) = Eo[s(yi, Xi; AJ. 


Obviously, this modification, suggested by Gallant and Tauchen 
[1996] implies a considerable increase in computational burden com- 
pared with the conventional estimator. 


In sum, although the approach suggested by Gallant and 
Tauchen [1996] is attractive for selecting suitable moment condi- 
tions, it implies a great deal of computational effort. Hence, this 
approach is not recommended for rather simple models considered 
here for the ease of exposition. If the model becomes much more 
complex and no convenient expressions for the first moments are 
available, the Gallant-Tauchen approach seems to be an attractive 
devise to select useful moments for the GMM procedure. 


9.5 A Minimum Distance Approach 


It is possible to construct a computationally more convenient min- 
imum distance procedure with the same asymptotic properties as 
the Gallant—Tauchen estimator. This approach was suggested by 
Gouriéroux, Monfort and Renault [1993]. 


Applying a Taylor series expansion to the moments derived 
from the auxiliary model gives 


f(9,) =Eo[sw(A)] 
=Eg[sw(A*)] + Hw(A*)(A — A*) + p(T?) 
=Hy(A*)(X — à*) + 0,(T-¥?) 
where Hy(A) = OF¢[sn(A)|/OX and A* is the value of À such 
that Eeọ[sn(à*)] = 0. As the sample size tends to infinity, 0* 
converges to the “pseudo-true value” defined in White [1982]. If 
Hyn(à*) is a regular matrix, then minimizing fy (0, A) An fn (0, A) 
is asymptotically equivalent to 


ĝup = argmin{(\ — A*Y VHA — A*)}, 
X 
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where Vx = Hy(A*)'AnHn(A*) tends to the covariance matrix of 
A — à*. For convenience the notation does not make explicit the 
dependence of à* on 8. 

For computing Ôu p the derivative 
or 
og’ 
needs to be evaluated. Usually, this derivative is much easier to 
compute than the derivative Fy (0) needed to compute the Gallant- 


Tauchen estimator. Since E(X) tends to \* as N — oo, this 
derivative is asymptotically equivalent to 


OE (A) 
On? 
which can easily be estimated by simulation techniques. 


D3(6) = 


Dy(8) = 


A second important advantage of this variant of the Gallant- 
Tauchen approach is that the covariance matrix Emp = E[(@up — 
8)(êmp — 9)’ can be estimated as 


Se = D, (0Y ÉD, (0) 


where $, = Eg[(A— A*)(X — A*)'], which is much easier to estimate 
than the expression (9-15). 


9.6 Finite Sample Properties 


9.6.1 Data Generating Process 


To compare the small sample properties of different GMM estima- 
tors, we simulate data according to the following model: 


Yit = 1(8° + BPxP + BX aN + uy > 0) 


x? = 1(%2 > 0), P(E2 > 0) =0.5 
al = 0.5271 + 0.05¢ + Mit, nie ~ U[-1, 1], 
Uit = OC; + Eit, ci ~ N(0,1), 

Eit = QEi t1 + O Eit En ~ N (0, 1), 


i=1,...,N, t=1,...,T. 
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which is also used in Breitung and Lechner [1996] and Bertschek 
and Lechner [1998]. The parameters (8°, 3°, 3% ,6,a,c) are fixed 
coefficients and I(-) is an indicator function, which is one if its 
argument is true and zero otherwise. The parameter o is chosen 
such that the variance of uj is unity. All random numbers are 
drawn independently over time and individuals. The first regressor 
is a serially uncorrelated indicator variable, whereas the second 
regressor is a smooth variable with bounded support. The depen- 
dence on lagged values and on a time trend induces a correlation 
over time. This type of regressor has been suggested by Nerlove 
[1971] and was also used, for example, by Heckman [1981]. 


We set B° = —0.75, BP? = BN = 1 in all simulations and 
ro D var(uit) = 1. To represent typical sample sizes encoun- 
tered in empirical applications we let N = 100, 400 and T = 5,10. 
Depending on the DGP, 500 or 1000 replications (R) were gen- 
erated. In order to diminish the impact of initial conditions, the 
dynamic processes have been started at t = —10 with TN - 
Ei t-11 = 0. 


In the simulations two different specifications are considered. 
First, a pure error component model with serially and mutually 
uncorrelated error components is obtained by setting a = 0 and 6 = 
V0.5. Furthermore o = v0.5 so that E(u) = 1. The correlation 
coefficient is p = 0.5. 


The second specification removes the equi-correlation pattern 
by setting a = 0.8, 6 = 0.2 and o = 0.5. In such a specification, 
the serial correlation is persistent but declines with an increasing 
lag length. The maximum correlation coefficient is 0.8 for a single 
lag and the correlation decrease to 0.4 for a lag length of four (the 
maximum lag length when letting T = 5). 


9.6.2 Estimators 


The first estimator is the MLE computed as in Butler and Moffitt 
[1982]. The number of evaluation points is set to 5 as a compro- 
mise between computational speed and numerical accuracy. The 
robust estimator according to White [1982] is used to estimate the 
standard errors. The results for this estimator are indicated by the 
acronym ML-RE. The pooled estimator denotes the ML estimator 
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ignoring the panel structure of the data. The standard errors are 
estimated allowing for serial correlation as in Avery et al. [1983]. 


The infeasible GMM-IV estimator is the optimal GMM esti- 
mator using the conditional mean restrictions (see Section 9.2). 
To compute the optimal instrument matrix the true correlation po 
is used. Bertschek and Lechner [1998] propose a feasible GMM 
estimator that estimates the unknown quantities in the expres- 
sion for the optimal instruments by nonparametric methods. The 
version that performs best in their Monte Carlo study is labeled 
GMM- WNP. 


Another approach to obtain the optimal instruments is to get 
a consistent estimate of the unknown correlation coefficient. The 
estimator GMM-IV (param) proposed here consists of three steps. 
In the first step a pooled probit model is estimated to obtain 
consistent estimates 3. Then T(T ~ 1)/2 second moments as in eq. 
(9-8) are used to compute consistent estimates of the correlations. 
For given values of 8, each moment condition depends only on 
the unknown correlation coefficient, which is bounded between -1 
and +1. Hence, grid search methods are used to determine the 
correlation coefficients. For a random effects model, using one such 
moment condition is sufficient for obtaining a consistent estimate 
of p. However, our estimation procedure allows for different values 
of the T(T — 1)/2 correlation coefficients pis (t,s = 1,...,T). This 
estimator is still optimal even when the covariance structure is 
more general than the equi-correlation structure of a random effects 
model. 


The pooled probit estimator as well as the GMM estimators 
with asymptotically optimal instruments are consistent no matter 
of the true error correlation. Furthermore, all GMM estimators 
have the same limiting distribution whether the true or consistently 
estimated optimal instruments are used. To yield a nonparametric 
estimate of the optimal instruments, a nearest-neighbor approach 
is applied (see Bertschek and Lechner [1998] for details). This 
resulting estimator is labeled as GMM-WNP. 


Following Gallant and Tauchen [1996] we employ simple auxil- 
iary models as score generators. First we use a linear error compo- 
nents model with scores given in (9-9) and (9-10). To compute the 
expectation of the scores fy(6, À) we generate 100 replications of 
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the model and compute the average of the scores. Using a Taylor 
expansion 


fv(O+6,3) = OI + MOMs =, (946) 
where 7 is a remainder term of order O(67), the derivatives are 
estimated by a least squares regression of fy(@ + 6,A) — fy(0,A) 
on 6, where 6 is a vector of normally distributed random numbers 
with zero mean and E(6*) = 0.02. We use 100 realizations of 6 
for computing the regression lines. This regression estimator has 
the advantage that it does not suffer from problems due to the 
discontinuity and non-differentiability of the simulator. At every 
iteration step of 0, the same sequence of random numbers are used. 
Using the pooled estimator with an estimate of p based on the small- 
sigma estimate (Breitung and Lechner [1996]) as initial values, the 
algorithm usually converges after 5-10 iterations. The resulting 
estimates are labeled as GT-score. 


Our practical experiences with such an algorithm suggest that 
the convergence properties crucially depend on the number of repli- 
cations used to estimate the conditional mean and derivatives of the 
moments. In particular, the least squares estimate of the derivatives 
requires at least 100 replications to obtain reliable estimates of the 
gradients. Furthermore, the step—length 6 in the Taylor expansion 
should be small enough to avoid a substantial bias but must be 
large enough to achieve acceptable properties of the least squares 
estimator. In our simulations we found that a value of F(6?) = 0.02 
provides a reasonable trade-off for our data generating process. For 
other processes, however, suitable values for 6 or the number of 
replications may be different. 


Adding the scores from the (cross section) probit estimator 
applied to the pooled dataset we obtain three additional moments. 
The weight matrix are computed using the GT-score estimator. 
The computational details for the simulations are the same as for 
the GT-score procedure. The resulting estimator is labeled as GT- 
scoret. Since the estimator employs the instruments of the pooled 
probit estimator, it is asymptotically more efficient than the pooled 
estimator. 


For both estimators using the Gallant-Tauchen approach the 
respective minimum distance estimators are computed. Accord- 
ingly, we denote these estimators as GT-MD and GT-MD*. We 
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use 300 Monte Carlo realizations of the model for computing the 
moments. The matrix of derivatives D,(@) were computed by least 
squares using a Taylor expansion of À. 


To compare the performance of the estimators we compute the 
root mean square error (RMSE) and the median absolute error 
(MAE). For the estimated standard errors of the coefficient we 
compute the relative bias. The precise definitions of these measures 
are given in Table 9.1. 


9.7 Results 


Table 9.2 presents the results for the specification with pure random 
effects. It turns out that with respect to RMSE and MAE the MLE 
performs best among all estimators. The second best estimator is 
the GMM estimator based on the optimal instruments derived from 
the conditional mean restrictions (Infeasible GMM). However, this 
estimator is based on a known correlation parameter pp so that 
such an estimator is of limited use in practice. 


With respect to the other GMM estimators, the ranking is 
not as clear. Generally, the GMM procedures using information 
about the error correlation like the feasible versions of GMM-IV 
or the estimators using a linear error component model as a score 
generator perform better than the pooled estimator ignoring the 
error correlation altogether. 


Comparing the small sample properties of the two asymptot- 
ically equivalent GMM estimators GMM-IV(param) and GMM- 
WNP, the latter estimator appears to be superior for all DGP’s, 
with the exception of the random effects DGP with N = 400 and 
T = 5. The potential small sample problems of these estimators 
is related to the estimation of the inverse of the conditional co- 
variance matrix of the residuals for each individual. GMM-WNP 
uses nonparametric methods that perform very well even in fairly 
small samples (see Bertschek and Lechner [1998] for details). The 
problem with GMM-IV(param) is that some of the estimated prs 
coefficients may end up at the boundary of the parameter space 
[-1,+1] when N is ’not large enough’. This is a particular problem 
when the number of coefficients to be estimated gets large (45 in 
the case of T = 10). A potential remedy is to enforce equality of 
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the p;, in the second step of the estimation. There are however two 
drawbacks of such a procedure. First, the estimator is no longer 
asymptotically efficient when the true DGP is different from the 
random effect model. Second, the simplicity of the estimator is 
lost. Therefore, we conclude that in applications GMM-IV(param) 
may be a preferable option only if an estimate of the correlation 
structure of latent residuals is of interest and if the dimension N is 
sufficiently large relative to the dimension T. 


The Gallant-Tauchen estimators derived from a linear error 
component model seem to work well. This is perhaps surprising 
since the linear model is quite a crude approximation to the panel 
probit model. In fact, there is much room for improving the fit 
of the auxiliary model. For example, the nonlinear mean function 
and heteroskedasticity of the errors are important features which 
are neglected by the linear approximation. Nevertheless, the scores 
of the linear model obviously provide useful moment conditions to 
be exploited by a GMM procedure. 


In small samples, the original version and the minimum dis- 
tance variant perform somewhat differently. For the smaller set 
of moment conditions the Gallant~-Tauchen GMM estimator (GT- 
Score) outperforms the respective minimum distance estimator 
(GT-MD), while for the enhanced set of moment conditions the 
(GT-MD+) estimator performs better than the (GT-Score+) es- 
timator. In all, however, the differences are small relative to the 
simulation error. 


The estimation of the standard errors for the coefficients are ex- 
tremely biased for Gallant-Tauchen estimators. The reason is that 
the standard errors are estimated assuming a correctly specified 
auxiliary model. Of course, this is not true in our application and 
it turns out that the resulting bias can be immense. Unfortunately, 
the computational effort for correcting the estimates along the lines 
suggested by Gallant and Tauchen [1996] was beyond the time 
schedule for the present work. 


The problems with the estimation of the standard errors can 
be side-stepped by using the minimum distance approach. In fact, 
the estimation of the standard errors for the (GT-MD-+) procedure 
seems to perform acceptable. However, the standard errors for the 
(GT-MD) estimator still possess a substantial bias. 
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To study the performance of the estimators under more general 
conditions, we introduce a persistent autocorrelation in addition to 
the random effects. Obviously, such a data generating process is 
difficult to distinguish empirically in a sample with a small number 
of time periods. All GMM estimators designed for the error com- 
ponent model remain consistent in the presence of a more general 
form of serial correlation. Thus, it is interesting to know whether 
the efficiency ranking is robust to differences in the autocorrelation 
pattern. Table 9.3 presents the results for such a process. 


The conclusions from simulations with other sample sizes are 
qualitatively similar. It turns out that the relative performance of 
the estimators is roughly similar to the case of a pure random effects 
model. However, it appears that the ML-RE estimator looses most 
of its relative advantages. Furthermore, the Gallant—Tauchen type 
of estimators perform worse than the competitors based on the 
conditional mean function. On the other hand, the GMM-WNP 
estimator turns out to have the most attractive properties. It is 
simple to compute and has favorable small sample properties for 
the DGP considered here. 


9.8 An Application 


An empirical example for our discussion of panel probit models is 
the analysis of firms’ innovative activity as a response to imports 
and foreign direct investment (FDI) as considered in Bertschek 
[1995]. The main hypothesis put forward in that paper is that 
imports and inward FDI have positive effects on the innovative 
activity of domestic firms. The intuition for this effect is that 
imports and FDI represent a competitive threat to domestic firms. 
Competition on the domestic market is enhanced and the profitabil- 
ity of the domestic firms might be reduced. 


As a consequence, these firms have to produce more efficiently. 
Increasing the innovative activity is one possibility to react to 
this competitive threat and to maintain the market position. The 
dependent variable available in the data takes the value one if a 
product innovation has been realized within the last year and the 
value zero otherwise. The binary character of this variable leads us 
to formulate the model in terms of a latent variable that represents 
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for instance the firms’ unobservable expenditures for innovation 
that is linearly related to the explanatory variables. 


The firm—level data have been collected by the Ifo-Institute, 
Munich (‘Ifo-Konjunkturtest’) and have been merged with official 
statistics from the German Statistical Yearbooks. The binary de- 
pendent variable indicates whether a firm reports having realized 
a product innovation within the last year or not. The independent 
variables refer to the market structure, in particular the market 
size of the industry In(sales), the shares of imports and FDI in the 
supply on the domestic market import share and FDI-share, the 
productivity as a measure of the competitiveness of the industry as 
well as two variables indicating whether a firm belongs to the raw 
materials or to the investment goods industry. Moreover, including 
the relative firm size allows us to take account of the innovation 
— firm size relation often discussed in the literature. Hence, all 
variables with the exception of the firm size are measured at the 
industry-level (for descriptive statistics see Table 9.4). 


The estimators applied to the example include the simplest one 
(pooled with GMM standard errors), both feasible GMM estimators 
based on second order moments and the minimum distance versions 
of the Gallant—-Tauchen estimator. For the latter estimator, we use 
1000 Monte Carlo replications for the simulated moments as well 
as its derivatives. Results for other estimators for that example can 
be found in Bertschek and Lechner [1998]. 


The results of the different GMM procedures are presented 
in Table 9.5. In all they are quite similar and yield the same 
conclusions. Both import share and FDI-share have positive and 
significant effects on product innovative activity. As expected by 
the Schumperian hypothesis that large firms are more innovative 
than small firms the firm size variable has a positive and significant 
impact. The coefficient of productivity is significantly negative for 
pooled, GMM-IV(param) and GMM-WNP but insignificant when 
using the GT-MD. 


An interesting finding is that the estimated standard errors of 
GT-MD tend to be substantially greater than the corresponding 
estimates of the alternative estimators. We have tried different 
numbers of replications or values of 6. The problem is that if 6 
is a small number, then the simulation error is large in relative 
terms, while for large values of ô, the estimates of the derivative 
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are biased. The standard errors presented in Table 9.5 are based 
on 10.000 replications and 6 = 0.05 - Gp, where Gp denotes the 
pooled estimator. We decided to choose a relative step size, because 
the parameter values appears to be quite different in magnitude. 
Repeating the computation of the standard errors using a different 
sequence of random numbers shows that these estimates reveal a 
considerable variability. Hence, these estimates do not seem very 
reliable and must be interpreted with caution. 


It is interesting to consider the serial correlation of the errors 
between different time periods (see Table 9.6) They are obtained 
as a by-product in the computation of the second step of GMM- 
IV(param) (see Section 9.6.2). It turns out that the autocorrelation 
function decays with increasing lag length. This result suggests that 
an autoregressive pattern is more suitable than the equi-correlation 
implied by an error component model. In any case, the GMM 
estimators remain consistent regardless of the form of the autocor- 
relation function. 


9.9 Concluding Remarks 


GMM is an attractive approach for estimating complex models like 
nonlinear error component models popular in current econometric 
research. Simple estimators can be constructed by using restric- 
tions implied by the conditional mean function. Asymptotically 
optimal GMM estimators based on the conditional mean function 
are obtained by using a parametric or nonparametric approach. 
Our Monte Carlo results clearly indicate that the nonparametric 
approach is superior in small samples. 


In addition we consider further moment conditions derived 
from higher order moments or the “score generator” as proposed 
by Gallant and Tauchen [1996]. From a practical perspective the 
latter approach is appealing, because it provides simple moment 
conditions and may yield highly efficient estimators. However, the 
computational effort is considerable for such estimators. Following 
Gouriéroux et al. [1993], we adopt a minimum distance analog of 
the Gallant—Tauchen estimator, which is much simpler to compute 
and generally renders valid estimates of the standard errors. 
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Our Monte Carlo results demonstrate that the GMM proce- 
dures considered in this paper perform well relative to the MLE. 
Although the Gallant-Tauchen estimator is based on a very simple 
“score generator”, the efficiency comes close to the MLE. However, 
the computational burden of the Gallant-Tauchen approach is im- 
mense and there are serious problems when estimating the standard 
errors for the parameters. Therefore, we do not recommend this 
estimator for models like the error component probit models, where 
much simpler GMM procedures with better small sample properties 
are available. However, if the model is more complicated and simple 
GMM estimators do not exist, the Gallant-Tauchen approach may 
be a useful devise for providing efficient moment conditions. 


Table 9.1: Measures of Accuracy Used in the Monte Carlo Study 


RMSE: root MSE V2 EE, 6. — 00)? 
MAE: median abs. error median, |ĝ, — bol 
BIAS(SE): | bias of est. stand. 


error in % 100 yo, [6,(6-) — 0 (8)]/o(8) 


Note: 6, denotes the rth realization of the simulated estimator for 0. 
om (6, ) indicates the rth estimate of the standard errors of Ô based 
on the asymptotic standard errors. UO indicates the standard 
errors computed from the Monte Carlo replications. 
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Table 9.2: Simulation Results for Pure Random Effects 
(a =0,6 = V0.5) 
RMSE x 10 MAE x 10 bias (SE) in % 
Bb ÊN Bp Bn Bb Bn 


N=100 T = 5, 1000 replications 
ML-RE 1.09 1.22 0.69 0.78 -5.4 0.4 
Infeasible GMM 1.09 1.29 0.73 0.84 0.2 -3.9 
Pooled 1.21 1.34 0.75 0.85 -1.4 -0.3 
GMM-IV (param) | 1.35 1.62 0.83 0.93 -10.4 -14.3 
GMM-WNP 1.14 1.29 0.76 0.86 -0.7 -0.3 
GT-Score 1.24 1.24 0.84 0.97 82* 27* 
GT-Scoret 1.22 1.30 0.73 0.97 —85* —-84* 
GT-MD 1.37 137 101 0.91 -12.2 -9.1 
GT-MD+ 1.21 1.31 0.75 1.00 -3.6 -4.3 
=400 T = 5, 1000 replications 
ML-RE 0.52 0.61 0.35 0.42 0.8 —2.2 
Infeasible GMM 0.55 0.60 0.37 0.39 -0.5 2.4 
Pooled 0.58 0.67 0.39 0.45 2.0 -1.0 
GMM-IV(param) | 0.55 0.62 0.38 0.42 -1.9 1.4 
GMM-WNP 0.54 0.63 0.36 0.43 1.9 —0.6 
GT-Score 0.58 0.67 0.36 0.47 215* 108* 
GT-Score* 0.62 0.67 0.41 0.46 -87* -86* 
GT-MD 0.66 0.75 0.41 0.50 9.2 8.5 
GT-MD+ 0.58 0.65 0.37 0.43 -5.7 -6.1 
N=100 T = 10, 500 replications 
ML-RE 0.74 0.85 0.52 0.53 1.2 5.7 
Infeasible GMM 0.80 0.89 0.55 0.61 -4.8 1.1 
Pooled 0.86 0.94 0.58 0.61 -0.5 5.0 
GMM-IV(param) | 2.72 3.71 1.38 1.54 4458 739 
GMM-WNP 0.84 0.96 0.58 0.60 -2.4 1.2 
GT-Score 0.82 0.99 0.56 0.70 14.9* -12.0* 
GT-Scoret 0.80 1.00 0.53 0.64 -85.0* -86.1* 
GT-MD 0.93 1.13 0.66 0.71 8.9 —5.4 


GT-MD* 0.78 0.98 0.53 0.62 -0.7 -8.2 


270 Alternative GMM Methods for Nonlinear Panel Data Models 


Table 9.2: (continued) 


RMSE x 10 MAE x 10 bias (SE) in % 


ML-RE 
Infeasible GMM 
Pooled 


0.37 0.45 0.24 0.30 1.4 -0.3 
0.41 0.49 0.28 0.34 39 0.2 


GMM-IV(param) | 0.58 0.77 0.34 0.40 -19 -27 
GMM-WNP 0.40 0.47 0.26 0.32 -1.4 -1.7 
GT-Score 0.41 0.46 0.27 0.29 159* 129* 
GT-Scoret 0.43 0.47 0.27 0.31 -86* —85* 
GT-MD 0.46 0.48 0.30 0.32 -12 -8.9 


GT-MD+t 


Note: * The estimates of the standard errors are based on the 
assumption of a correctly specified model. 
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Table 9.3: Simulation Results for AR(1) and Random Effects 
(a = 0.8, 6 = 0.2) 


RMSE x 10 MAE x 10 bias (SE) in % 


N=100 T = 5, 1000 replications 

ML-RE 0.61 0.64 0.40 044 4.0 2.2 
Infeasible GMM 0.61 0.56 041 0.45 -0.2 -1.3 
Pooled 0.70 0.73 0.47 0.51 -1.3 0.9 
GMM-IV (param) | 0.63 0.68 0.42 0.47 -0.2 -1.7 
GMM-WNP 0.61 0.68 0.41 0.46 2.2 -0.3 
GT-Score 0.68 0.71 0.47 0.47 472* 388* 
GT-Scoret 0.68 0.70 0.45 0.47 -85* —85* 
GT-MD 0.68 0.71 045 0.47 -22 -19 
GT-MD* 0.60 0.64 0.41 040 -8.3 -4.6 


Note: * The estimates of the standard errors are based on the 
assumption of a correctly specified model. 


Table 9.4: Descriptive Statistics 


mean std.dev. 
10.540 1.00 


ln of ind. sales in DM 
ratio of empl. in business 
unit to empl. in industry 0.074 0.290 


In(sales) 
Rel. firm size 


Imp. share ratio of industry imps. to 

(sales + imps.) 0.250 0.120 
FDI-share ratio of industry FDI 

(sales + imports) 0.046 0.047 
Productivity | ratio of industry V.A. to 

industry empl. 0.090 0.033 
Raw mat. = 1 if firm is in ‘raw mat.’ 0.087 0.280 
Invest. = 1 if firm is in ‘invest. 


0.500 0.500 


goods’ 


= 1 if prod. in. is realized 
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Table 9.5: Estimation Results for the Innovation Probit 
pooled GMM-WNP GMM-IV GT-MD GT-MD+ 
(param) 


In(sales) 0.18 0.15 0.17 0.15 0.17 
(0.04) (0.04) (0.04) (0.06) (0.03) 
R. size 1.07 0.95 1.20 0.90 0.95 
(0.31) (0.20) (0.31) (0.21) (0.20) 
Imp. sh. 1.13 1.14 1.12 0.99 1.05 
(0.24) (0.24) (0.24) (0.29) (0.21) 
FDI sh. 2.85 2.59 2.78 2.87 2.88 
(0.68) (0.59) (0.68) (0.79) (0.53) 
Prod. -2.34 -1.91 —2.54 —2.49 —2.50 
(1.32) (0.82) (1.24) (1.67) (0.70) 
R. mat. —0.28 —0.28 0.24 —0.20 -0.24 
(0.13) (0.12) (0.13) (0.11) (0.07) 
Invest. 0.19 0.21 0.19 0.20 0.20 
(0.06) (0.06) (0.06) (0.08) (0.05) 
Const. —1.96 -1.74 -1.86 -1.68 -1.90 
(0.38) (0.37) (0.37) (0.54) (0.31) 


Table 9.6 Estimated Correlation Matrix 


1985 1986 1987 1988 


1984 0.60 0.65 0.58 0.48 
1985 0.64 0.56 0.37 
1986 0.68 0.58 
1987 0.64 
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Chapter 10 


SIMULATION BASED METHOD OF 


MOMENTS 


Roman Liesenfeld and Jorg Breitung 


The estimation of unknown parameters generally involves opti- 
mizing a criterion function based on the likelihood function or a set 
of moment restrictions. Unfortunately, for many econometric mod- 
els the likelihood function and/or the relevant moment restrictions 
do not have a tractable analytical form in terms of the unknown 
parameters rendering thereby the estimation by maximum likeli- 
hood (ML) or the generalized method of moments (GMM) infea- 
sible. This estimation problem typically arises when unobservable 
variables enter the model nonlinearly, leading to multiple integrals 
in the criterion function, which cannot be evaluated by standard 
numerical procedures. Prominent examples of such models in fi- 
nancial econometrics are continous-time models of stock prices or 
interest rates and discrete-time stochastic volatility models. 


Until recently, estimation problems due to the lack of some 
kind tractable criterion function were often circumvented by using 
approximations of the model producing criterion functions simple 
enough to be evaluated. However, using such approximations may 
lead to inconsistent estimates of the parameters of interest. An 
alternative solution in such cases which has received increased at- 
tention over the last few years, is the use of Monte Carlo simulation 


methods to compute an otherwise intractable criterion function.’ 

1 It is worth noting that Monte Carlo simulation methods have already been 
used for a long time in Bayesian econometrics to evaluate posterior distrib- 
utions, see e.g., Kloek and van Dijk [1978]. 
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Seminal for the development of this type of estimation procedures 
were the contributions of McFadden [1989] and Pakes and Pollard 
[1989] who introduced the Method of Simulated Moments (MSM) 
in a cross sectional context. This approach, which was extended 
to time series applications by Lee and Ingram [1991] and Duffie 
and Singleton [1993], modifies the traditional GMM estimator by 
using moments computed from simulated data rather than the 
analytical ones. Like the GMM estimator, the MSM estimator 
is consistent and asymptotically normal when the number of ob- 
servations approaches infinity, and is asymptotically equivalent to 
the “usual” GMM estimator if the number of simulations goes to 
infinity. However, in a fully parametric model the MSM, just as 
the GMM, is inefficient relative to procedures based on the full 
likelihood due to the arbitrary choice of moment restrictions. This 
issue is addressed by the indirect inference estimators proposed by 
Gouriéroux, Monfort and Renault [1993], Bansal, Gallant, Hussey 
and Tauchen [1993], [1995] and Gallant and Tauchen [1996a]. 
These approaches which represent extensions of the MSM introduce 
an auxiliary model in order to estimate the parameters of the model 
of interest. The first version of the indirect inference as proposed 
by Gouriéroux, Monfort and Renault [1993] uses the parameters of 
the auxiliary model to define the GMM criterion function, whereas 
in the second version as suggested by Bansal, Gallant, Hussey and 
Tauchen [1993], [1995] and Gallant and Tauchen [1996a] the scores 
of the auxiliary model generate the moment restrictions used in 
the GMM criterion function. Since in both procedures the GMM 
criterion is an intractable function in terms of the parameters of 
interest, simulations are used to evaluate it. Both indirect inference 
estimators are consistent and asymptotically normal as the number 
of observations goes to infinity and approach the fully efficient 
estimator if the auxiliary model is appropriately chosen. When 
the auxiliary model is based on the semi—nonparametric model of 
Gallant and Nychka [1987], as proposed by Gallant and Tauchen 
[1996a], one may hope that the loss of efficiency of the indirect 
inference estimator is small. 


The purpose of this chapter is to give a selective review of the 
MSM techniques, and to illustrate their applications to financial 
models. 
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Besides these moment based simulation approaches, a variety 
of other simulation estimators are proposed in the literature includ- 
ing simulated maximum likelihood (Danielsson and Richard [1993] 
and Richard and Zhang [{1997]), and Markov Chain Monte Carlo 
procedures (Jacquier, Polson and Rossi [1994] and Kim, Shephard 
and Chip [1996]). Surveys on these likelihood based simulation 
methods are given by Ghysels, Harvey and Renault [1996] and 
Shephard [1996]. An extensive overview of simulation based es- 
timation methods including MSM and likelihood based procedures 
can be found in Gouriéroux and Monfort [1996]. 


This chapter is organized as follows. In Section 10.1 we outline 
the estimation context and give some examples. The MSM and 
the indirect inference estimator are discussed in Sections 10.2 and 
10.3, respectively. Section 10.4 reviews the semi—nonparametric 
auxiliary model and in Section 10.5 we address selected practical 
issues concerning the application of these estimators. We conclude 
in Section 10.6. 


10.1 General Setup and Applications 


Let y, t = 1,...,Z denote an n-dimensional vector of observ- 
able dependent variables and x, a k-dimensional vector of observ- 
able strongly exogenous variables. For expositional convenience 
it is assumed that y; and 2; are stationary. The nonlinear dy- 
namic model is characterized by the conditional density ho(y:|2:), 
where z = [y}_1,--+5 Y1, Yo» Th- £4)’ is the vector of conditioning 
variables and the initial conditions are represented by yo. We 
want to estimate the p-dimensional parameter vector 9 from the 
model M := {h(y:|z:;0),9 € O}, where © denotes the parame- 
ter space. The true value ĝo is a unique value of @ such that 
ho(yelzt) = h(yelze3 40). In the following, we use A(-) as a generic 
notation for all density functions. 


The estimation of ĝo is generally based on the likelihood func- 
tion Lr(0) = II} h(y:|2439) or on moment restrictions based on a 
set of moments such as E[y;,|z:| or E[y,y}|z,]. Here we are interested 
in the cases where the likelihood function or the relevant moments 
have an intractable form, making the ML or GMM procudures 
infeasible. Nevertheless, we assume that the model allows us to 
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simulate values of the process {y;} for a given value of the parameter 
vector @ and the initial conditions yo. 


For dynamic models with lagged endogenous variables two dif- 
ferent simulation schemes exist (see, Gouriéroux and Monfort (1996, 
p. 17]). If the model admits a reduced form yi = 0(2:, €4;9), where £: 
is an error term stochastically independent of z; and with a known 
distribution independent of 9, simulated random variables y{" (0), 
(r =1,..., R) from the distribution h(y,|z;0) can be generated as 
follows. Artificial random variables ef”) from the distribution of €; 
are generated and used to calculate 


ue” (8) = olze1”;8) 

for the observed values of z = [y$ 1,- --, Y1, Yo Vis---, 4)’ and a 
value of the parameter vector 0. For a large number of replica- 
tions R, the empirical distribution of the simulated values y (0), 
(r =1,...,R) approximates the conditional distribution h(y;,|z,; 6) 
for every t. Since the simulations are performed conditionally on 
the observed lagged endogenous variables, this simulation scheme 
is called conditional simulations. The second approach, termed 
path simulations, is to generate simulated values of y, condition- 
ally on simulated lagged endogenous variables, i.e., conditionally 
on 20 (0) = fy, (0),..., yO (0), yh 2... 24], using some kind 
of recursion. For large R, the empirical joint distribution of 
yl O)... yE (0), (r = 1,..., R) approximates the distribution 
h(n, seis „Yrl|zı, -- + TT; 0). 

We illustrate next these procedures using some financial appli- 
cations.? 


ExampLe 10.1 Discrete—time Stochastic Volatility Model 


The standard discrete-time stochastic volatility (SV) 
model proposed by Taylor [1986], [1994] and others is 
given by 

Ye = exp{w; /2}ur (10-1) 


wr = Y +W i HUH , Aa N l 5 (10-2) 


2 As in most financial econometrics applications, a time series framework is 


used here. Examples for cross sectional applications are given by Gouriéroux 
and Monfort [1993], [1996] and Stern [1997]. 
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where y is the observable return of a financial asset 
and w; is the unobservable log volatility. The error 
processes u; and 7 are mutually and serially independent 
with known distributions. In accounting for the observed 
autocorrelation in the variance of financial time series, 
this SV model represents an alternative to the ARCH 
and GARCH specifications proposed by Engle [1982] and 
Bollerslev [1986]. Since the latent log volatility wy enters 
the model in a nonlinear form, the conditional density 
h(y:|22;9) with 0 = |[y,6,v]’ and z = [yr-1,.--,y1, Yol 
does not have an explicit analytical form. To obtain 
the (marginal) likelihood function associated with the 
observable variables, the latent variables are “integrated 


out” from the joint distribution of y,...,yr,w],.-.,wh 
denoted by Ah(y,..., yr, wï, ..-.,w%]0). This distribu- 
tion can be factorized as h(y,..., yr, wi,...,wp|0) = 


IIL; h(y:|wi; 0)h(w%|we_,;0), where h(y,|w;) is the con- 
ditional density of the returns given the log volatility and 
h(wž|wž_;;0) denotes the conditional density of the log 
volatility given its past value. Hence, for a given initial 
value of the log volatility wi the marginal likelihood has 
the following form 


T 
Lr) = | f T] ohwi hwo; 0) doy... dev. 
t=1 


For this T-dimensional integral no closed form solution 
exists, nor can standard numerical methods be applied 
to evaluate it making the ML estimation infeasible. Fur- 
thermore, even if the standard SV model could be es- 
timated by GMM using unconditional moments such as 
E|ly|], Ely?] or E[y2y?_,], GMM is relatively inefficient, 
especially, if the persistence parameter 6 is close to one 
(see, e.g., Jacquier, Polson and Rossi [1994] and Ander- 
sen and Sgrensen [1996]). However, the SV model given 
by (10-1) and (10-2) defines a simple data generating 
process which allows to generate values from the joint 
distribution h(y1,...,yr|@) implied by the model using 
path simulations. Note though, that conditional simula- 
tions from h(y:|y:-1,---, Yo; 0) appear to be infeasible since 
the SV model does not admit an explicit expression of 
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the reduced form in terms of lagged endogenous variables 
Y: = o(Ye-1, +++ 9 Yo, Et; 0). 


a 
ExamPLe 10.2 Stochastic Differential Equations 
Consider the following scalar stochastic differential equa- 
tion: 
du, = a(vz, O)dt + b(u;, 0)dW, , O0<t<N, (10-3) 


where a(v;,6) and b(v;,0) are the drift and the diffu- 
sion function, respectively, and W; is a Brownian motion. 
Such continuous-time processes are often used to model 
stock prices and interest rates. In practice, however, the 
variables are observable only at some discrete (possibly 
equispaced) points. Hence, the observable variables y, 
(t =1,...,T) are given by y: = vsa for some A > 0, where 
the time interval between two observations is [t, t+ A). For 
arbitrary drift and diffusion functions, the distribution of 
the observable variables generally does not have a closed 
form expression. A closed form can be obtained only for 
some special drift and diffusion functions. As an example, 
consider the square root process proposed by Cox, Ingersoll 
and Ross [1985] to model the evolution of interest rates 


dv; = (ao + a1U;)dt + Bo/u:dW, . 


This stochastic differential equation implies a joint distrib- 
ution of the observable variables y,,...,yr given by 


T 
I A(yelye-15 9), 
t=1 


where h(y:|y:-1; 9) is a non—-central x? distribution. How- 
ever, for more complicated specifications the conditional 
density h(y:|y:-19) and, in general its moments, do not 
have a tractable form since h(y;|y:-19) appears as a mul- 
tiple integral (see, e.g., Gouriéroux and Monfort [1996, 
p. 10f]). This motivates the use of alternative procedures 
to the standard ML and GMM estimators. An example 
for a specification with an intractable density h(y;|y:-1; 4) 
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is the following generalisation of the Cox—Ingersoll—Ross 
model: 
dv: = (ao + avı ) dt + Bovi” dW; 5 


which is proposed by Chan, Karolyi, Longstaff and Sanders 
[1992]. To simulate values of the observable discrete-time 
variables according to a continous-time model, one can 
use a discrete-time approximation, for example, the Euler 
approximation. If the time interval between two observa- 
tions [t,t + A) is divided into subintervals of length 7, the 
corresponding Euler approximation of (10-3) becomes 


Vitkr = Vi+(k-1)r + T A(Vt+(k-1)r> 0) + VT b(Utt k-17 O)» 
kaS es 


where 7, is an i.i.d.N (0,1) random variable. If the time 
interval 7 is sufficiently small, this approximation can be 
used to simulate values from h(yı,...,yr|0) according to 
Ye = O(Ye-1, €439), where e; = [M1,---,M,1/7]’ is the vector 
of error terms. 

B 


The common feature of Examples 10.1 and 10.2 is that (par- 
tially) unobservable processes enter the model nonlinearly, making 
criterion functions commonly used for estimation intractable. Fur- 
ther examples for this in financial econometrics are the continous- 
time stochastic volatility models of Hull and White [1987] and Ches- 
ney and Scott [1989], the market microstructure model proposed 
by Forster and Viswanathan [1995], the dynamic equilibrium model 
for asset prices estimated by Bansal, Gallant, Hussey and Tauchen 
[1995] and the multifactor latent ARCH models of Diebold and 
Nerlove [1989] and Engle, Ng and Rothschild [1990]. 


10.2 The Method of Simulated 
Moments (MSM) 
Consider a dynamic model with a well defined reduced form y; = 


0(2t, €130) which can be used to simulate values of y; from h(y:|2:, 0) 
for observed values of the conditioning variables 


ee Pe: roe ae; Iy 
Zt = [l -3 Y1 Yos Tts Ve_1)-- stil s 
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We will focus on the m-dimensional moment function of the form 


P(Ye, 2459) = 8(Ye, zt) — 0( 2459) , (10-4) 


with m > p and where s(y:, z+) is a function on the data and ø (z+; 4) 
is the theoretical counterpart defined as 


a (2430) = Eo[s(ye, 2) |Z]. 


Here Eo(-|2,) indicates that the expectation is computed with re- 
spect to the density h(y,|z,;0) and o(z%;@) represents conditional 
moments as, for example, Eo(y:|z:) or Eo(yy;|z:). The index is 
dropped if the expectation is taken with respect to the true process, 
i.e., , E = E. We assume that for 6) the empirical moment 
condition 

E|e(yt, zt Ooz] =0 for allt 


is satisfied. Let f(y, 2:;90) = B(z:)'e(yt, 2390), where B(z) is 
some nonlinear matrix function on z, then the corresponding set 
of unconditional.moment restrictions is given by (see, e.g., Newey 
[1993]) 

Elf (yt, 2439)] =O for allt. 


If the expression o(z,;9) cannot be computed analytically, it may 
be approximated using simulation methods. Since o(z;;9) is the 
expectation value of s(y:, z:) evaluated with respect to h(y:|z:; 4), 
a natural unbiased estimator for o(z; 0) is given by 


R 


A 1 x 
Gr(2058) = z 2 slui (4), a , (10-5) 
r=1 
where y (9), (r =1,..., R) are simulated random variables drawn 


from the distribution h(y;|z;;6) for the observed values of z. The 
natural estimator of o(2;9) given in equation (10-5) results from 
sampling data using h(y,|z:,6). However, this estimator may have 
undesirable properties. For example, it may not be differentiable 
with respect to 0 or it may have a large variance. Therefore, alter- 
native methods of estimating o (z, 9) such as importance sampling 
procedures were proposed to obtain an estimator with improved 
properties (see, Gouriéroux and Monfort [1993] and Stern [1997]). 
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If the natural Monte Carlo estimator (10-5) is used to estimate 
the moment restrictions the method of simulated moments (MSM) 
estimator for ĝo is obtained by minimizing the criterion function 

argmin 
ORs = 


B Frye, 2; 0] A bs Fryer 23 2) 


(10-6) 


where 
Fr(yes 2439) = B(z)'[s(ye, zt) — Fr(%5 4) 

and A denotes an appropriately chosen positive definite weighting 
matrix. If the simulation sample size R tends to infinity, ¢p(z:; 4) 
converges almost surely to Ee[s(y:, 2t)|z:] and the MSM estimator 
equals the corresponding GMM estimator. However, as the sample 
size T tends to infinity, the MSM estimator is consistent for any 
fixed R > 1 as long as different random draws are used across t (cf. 
McFadden [1989]). The reason for this is that for the estimator 
OF sm the simulation error is “averaged out” by using the mean of 
Gr(%39), (t=1,...,7). 


The fact that the MSM estimator is consistent for any R > 1 
should not be taken as an indication that R is irrelevant for the 
asymptotic properties of ô sọ as T — oo. This becomes clear 
from considering the asymptotic distribution of the MSM estimator, 
which results as T'? 6B 5-90) (0, avar(0%,,,,)). The asymp- 
totic covariance matrix of OR s m» as it results from the fact that 
{F (Ye, zt; 90)} is by construction serially uncorrelated with identical 
distributions, has the form (see, Gouriéroux and Monfort (1996, 
p. 29]) 


avar (Of; sa) = 
ED + RA 1 D'A var[ f(y (00), z; 90)] ADET! , 
(10-7) 
where 5 4 
D =E |B ey a »] 
yy =D'AD 


Eo =D' A var[f (yt, 24; 90)] AD . 


The lower bound of the asymptotic covariance matrix obtained for 
R — œ is given by the asymptotic covariance of the corresponding 
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GMM estimator ©;'X,D7'. However, the asymptotic covariance 
matrix of the MSM estimator contains, compared to that of the 
GMM estimator, an additional component which is due to the 
variation in the Monte Carlo estimates of the moment restrictions. 
This additional Monte Carlo sampling variance vanishes as the 
simulation sample size increases and the MSM estimator attains 
the efficiency of the corresponding GMM estimator. 


The asymptotic optimal weighting matrix which minimizes the 
asymptotic covariance of 0%,¢,, for a given set of moment restric- 
tions is: 


Ao = (varl (ye, ze; 80)] + var fy? Oo), 245 00) 


For this optimal choice of the weight matrix the asymptotic covari- 


ance matrix of the MSM estimator is avar(9%.¢,,) = [D’AoD]-?. 
The MSM estimator given above is based on conditional mo- 
ments of the function s(y:, z+) given 


a= [yea tee Vis Yo» UPd see » 24)’. 

A necessary requirement for using such conditional moments for 
MSM estimation, is that the model admits a well defined reduced 
form y = o(2:,¢:;6) in terms of exogenous and lagged endoge- 
nous variables in order to perform conditional simulations from 
h(y:|22;0). These conditional simulations are necessary to obtain 
unbiased estimates for o(z;,0) based on estimators such as the one 
given in equation (10-5). However, for models which include unob- 
servable variables nonlinearly as, for instance, the SV model in Ex- 
ample 10.1, a reduced form in terms of lagged endogenous variables 
is generally not available. Hence, in such cases the MSM estimation 
based on conditional moments given lagged endogeneous variables 
is infeasible. Then one may use restrictions based on moments 
conditional only on the exogenous variables or for pure time series 
models restrictions derived from unconditional moments. Such 
MSM approach for pure time series applications has been proposed 
by Duffie and Singleton [1993], and has been applied by Forster and 
Viswanathan [1995] and Gennotte and Marsh [1993] for estimating 
a market microstructure model and a dynamic asset pricing model, 
respectively. 
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This unconditional version of the MSM estimator is based on 
a m-dimensional moment function of the form 


(Yt - - - , Ye-1;0) = S(Yts -> - , Ye—1) — a(0) , t= EET A ’ 
(10-8) 
where o(@) represents the unconditional expectation 
Eo[s(y, tee Yt-1)]- 
The corresponding set of moment restrictions is given by 


Elf (yt, -< Yei; 90)] = 0. 


These restrictions include moments such as 


Ely) and Eo(yy,) 


as well as cross order moments of the form Ee(yy}_;). If y:(), 
(t = 1,...,R) denotes a simulated path from the distribution 
h(q,---,YR|9) implied by the model, the MSM estimator based 
on these unconditional moments is obtained by 


OF ayy a 
1g Toa 
T X sly, eY) = ral) A È 53 S(Yt, +++ Yeu) — ĈR(0) 
t=1 t=1 
where 
ĉr(0) = R Dul) peers Ye- (0)] . 


The matrix A denotes the weight matrix and ĉg(0) is an unbiased 
Monte Carlo estimator for (9). As the moment function (10-8) 
derived from the dynamic model h(y;|z;; 0) is expected to be serially 
correlated, the asymptotic optimal weight matrix is given by 


Ap = (i, fas { Ys. +) Yen) i) i 


Since s(yt,...,Yı+-ı) is independent of the parameter 9 and in- 
dependent of the simulated values y,(@), the matrix Ap can be 
estimated by procedures discussed in Chapter 3. Like the MSM 
estimator based on conditional moments, the MSM estimator us- 
ing unconditional moments is consistent and asymptotically nor- 
mally distributed as T goes to infinity. The asymptotic distrib- 
ution for the optimal weight matrix Ao results as T/?(0%5,, — 


6) +N (0, avar(6%,o,)), with avar (ôF su) = [1+(1/R)|[D’AoD]“! 
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and D = E[00(6))/06’| (see, Duffie and Singleton [1993]). The 
factor [1 + (1/R)]| in the asymptotic variance accounts for the addi- 
tional variation of the estimator due to the Monte Carlo sampling 
variance which vanishes as R goes to infinity. 


10.3 Indirect Inference Estimator 


The MSM approach is used to optimize a GMM criterion 
function, which is too complicated to be computed analytically. 
Another approach proposed by Gouriéroux, Monfort and Renault 
[1993], is to use a criterion function derived from an auziliary, pos- 
sibly misspecified model and to recover the structural parameters of 
the original model from the parameter estimates of the misspecified 
model. Unfortunately, the relationship between the auxiliary and 
the structural model is too complicated to admit an explicit solu- 
tion. Therefore, simulation techniques are used to derive the final 
estimates. Another view about the indirect inference estimator 
followed by Gallant and Tauchen [1996a] is that the derivatives 
of the criterion function for the auxiliary model (usually the log— 
likelihood function) can be used as a moment function for a GMM 
procedure. Thus, the scores of the Quasi-ML procedure of the pos- 
sibly misspecified auxiliary model are the moments to be matched 
by a GMM approach. Hence, in this context the auxiliary model is 
also termed the score generator. However, if the indirect inference 
estimator is combined with some flexible data dependent choice of 
the auxiliary model, the resulting estimator can be expected to be 
more efficient than a GMM procedure based on an ad hoc selection 
of the moments. For this reason, an indirect inference estimator 
based on such a flexible auxiliary model is called Efficient Method 
of Moments (EMM). 


Consider a dynamic model characterized by h(y,|2:;9) which 
allows us to simulate values of y; using path simulations but with 
intractable criterion functions commonly used for estimation. Fur- 
thermore, let M* = {h*(y|z; A), A E A} denote the auxiliary model 
with the g-dimensional vector of auxiliary parameters A, where 
q => p, that is, the auxiliary model has at least as many parameters 
as 0. The model is misspecified, if there exists no parameter vector 
A* such that ho(y:|z) = h* (ye|z:; A*). However, it is assumed that 
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the auxiliary model has some tractable criterion function (here the 
log-likelihood) allowing us to estimate A. For example, if we are 
interested in estimating the SV model in Example 10.1, a possible 
auxiliary model may be a GARCH model which is relatively easy 
to estimate by ML compared to the SV model. 


The Quasi—ML estimates of À are computed by maximizing 
the criterion function Q(Y, X; A) = T7! E} logh*(y%|%; A) with 
Y =([y,..., yr] and X = [x,..., 27], that is 


~ argmax 
Ar= à QY,X;A). 
The first order condition is that the score vector 


HY) = oe A) T (yelze; A) 


(10-9) 
equals zero. An important concept linking the structural para- 
meters 0 with the auxiliary parameters A, is the so-called binding 
function A = b(0) (see, Gouriéroux and Monfort [1996, p. 67}). 
The binding function is obtained from the solution of the equation 
Egg|Y, X;6(9)] = 0, where the expected value is evaluated with 
respect to the joint distribution h(Y, X|@) implied by the structural 
model. 


From White [1994] it is known that the estimates r converge 
in probability to the pseudo-—true value given by ào = b(0o). Hence, 
if A and @ are of the same dimension and if it is assumed that 
there exists an inverse function b~1(-), it is possible to obtain an 
indirect inference estimator for 0) as Êr = b-1(A\r). The practical 
problem is, however, that usually the function b(0) is unknown 
and must be evaluated using Monte Carlo simulations. Therefore, 
we generate R simulated paths y{”(6),...,y (0), (r = 1,..., R) 
from the distribution h(y1, . . . , yr|£1, - - -, £r; 0) for observed values 
of the exogenous variables. For each simulated path we obtain an 
estimate of the vector of auxiliary parameters denoted by E (6). 
Then the unknown binding function b(@) can be approximated by 


A Kieu 
ba(8) = DAP’ (0). 
r=1 
If b(0) is replaced by br(0) we can construct a simulated minimum 
distance estimator as 
rg 


ER y =O Ar — RO Ar — Îr), (10-10) 
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where A is a positive definite weight matrix. This indirect in- 
ference estimator suggested by Gouriéroux, Monfort and Renault 
[1993] searches for a value of 0, for which simulated data from the 
structural model approximate the properties of the observed data 
summarized by the estimate Àr as close as possible. 


As the sample size T goes to infinity, the indirect inference 


estimator is consistent and asymptotically normal for any fixed R > 
1. The asymptotic optimal weight matrix is given by 


Ao = Joly Jo , 


where 
; QY, X; A 
Jo = jim E {eee 2) 
Do = lim var {VTg(Y, X; ào) — E[VT9(¥, X;o)|X]} . 


For this optimal choice of the weight matrix the asymptotic distrib- 
ution of the minimum distance estimator (10-10) is obtained 
as T'/2(@2,, — 0))-“+N(0, avar (6%,)), where the asymptotic 
variance of 0%,,, is given by avar(6t,,) = [1+(1/R)][B’ AB] with 
B = 0b(09)/00' (see, Gouriéroux, Monfort and Renault [1993]). 
The second approach for deriving an indirect estimate from the 
auxiliary model suggested by Gallant and Tauchen [1996a] is to use 
the moment conditions implied by the scores of the auxiliary model 


E gl, X;0(60)] =0. (10-11) 


Using path simulations from the structural model to approx- 
imate Ep g[Yr, Xr; b (0)|, the GMM estimation procedure based 
on the scores of the auxiliary model results as 


a argmin 


or = 0 Gr(9,Ar)'AGr(9, Àr), (10-12) 
where 
= 19 3.) LR 1S Alogh* ly” (0) | 2 (0); Àr] 
9Gr(9, Ar) = R >, T 2 BoE N (10-13) 


is the simulated score function which approximates the moment 
conditions (10-11) and A is a positive definite weight matrix. For 
this estimator the asymptotic optimal weight matrix is given by 
Ao = Ig}. Notice, that the score vector (10-9) for the observed 
data and the estimate Ar is equal to zero as implied by the first 
order condition. Hence, the estimator 02, searches for a value of 6, 
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for which the simulated data from the structural model mimic this 
first order condition. 


Both estimators 6%, and 62, are derived from similar 
principles although the criterion function is different. Indeed 
Gouriérioux, Monfort and Renault [1993] show that both ap- 
proaches yield asymptotically equivalent estimators as T goes to 
infinity. Thus, the choice between these estimators is a matter 
of computational convenience. As far as this is concerned the 
following should be considered. As usual for nonlinear optimization 
problems, estimations based on 6%, and 62,, are performed with it- 
erative optimization algorithms. However, at every iteration step of 
the optimization with respect to 0, the parameter based estimator 
0 p requires “secondary” optimizations to estimate the auxiliary 
parameters À, whereas the score based estimator 62, requires only 
one optimization concerning À. Furthermore, the estimator OF D> 
using the optimal weight matrix Ag, requires an estimate of Jo based 
on the Hessian matrix which is not necessary for the estimator 62. 
On the other hand, for the computational efficiency of the score 
based estimator Op, it is necessary that the score vector of the 
auxiliary model (10-9) is available in a closed form which is not 
essential for the parameter based estimator. 


The asymptotic efficiency of the indirect inference estimators 
depends on the potential of the auxiliary model to approximate 
the true process. In fact, if h(y:|z;40) = h*(yz|z; 0(90)) in some 
neighborhood of ĝo, the structural model is “smoothly embedded 
within the score generator” (see, Gallant and Tauchen [1996a]), 
and it follows that the indirect inference estimator is asymptotically 
efficient. However, in principle two different approaches to select an 
appropriate auxiliary model (or score generator) exists. The first 
approach is to search for an auxiliary model that is able to mimic 
the salient features of the structural model, and is as close to it as 
possible. For the SV model (see Example 10.1), for instance, a po- 
tential candidate may be a GARCH specification as the predictions 
concerning the stochastic behavior of the returns resulting from a 
GARCH model and the SV model are very similar. The second 
approach as advocated by Gallant and Tauchen [1996a] is a data 
dependent choice of the auxiliary model. Specifically, they propose 
to adopt a flexible, possibly nonparametric, score generator which 
can be expected to capture any dynamic and distributional features 
of the observed data. Such a data dependent procedure associated 
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with the term EMM is considered in the next section in greater 
detail. 


10.4 The SNP Approach 


To achieve a high level of efficiency for the indirect inference esti- 
mator, Gallant and Tauchen [1996a] consider the class of semi- 
nonparametric (SNP) models of Gallant and Nychka [1987] for 
constructing the score generator. As shown by Gallant and Long 
[1997], these SNP models can be expected to capture the proba- 
bilistic structure of any stationary and Markovian time series. 


The SNP model as applied by Gallant and Tauchen [1996b], 
Andersen and Lund [1997], and Gallant, Hsieh and Tauchen [1997] 
to various financial time series can be represented by the following 
conditional density: 


[P (ur, z)? (ue) /|det(S,)| , 
[PEP 6o)du 


Here z = [yt_1,---,Yt_,|’ and A, is a q-dimensional parameter vec- 
tor. The n-dimensional vector u; is obtained from a standardization 
of Ye, ie., , Us = S7 (Y: — p), where p and S, are a location and a 
scale functions, respectively. The density function of a multivariate 
normal distribution with mean zero and unit covariance matrix is 
denoted by ¢(-), and P(uz, zt) is a polynomial in u; with coefficients 
depending on z. The integration constant [[P(v, z;)]? 6(v)dv en- 
sures that hë (y:|z:; Àq) integrates to unity. 


hi (yelet Aq) = (10-14) 


The parametrizations of the location function, the scale func- 
tion, and the polynomial are as follows. To accommodate the dy- 
namic structure in the mean, the location function is the conditional 
mean corresponding to a vector autoregression given by 

lu 
pi = bo + X Biyi- - (10-15) 
i=l 
To capture the dynamics in the variance, the following ARCH-type 
scale function is applied 


Is 
vech (Sa) = Co + 5 Ci\ye-i E Le-il ; (10-16) 


i=1 
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where vech(S;) is the vector containing the [n(n + 1)/2] distinct 
elements of S; and |y%:_; — Hi| indicates the elementwise absolute 
value. Alternative scale functions applied by Andersen and Lund 
[1997] and Andersen, Chung and Sgrensen [1998] are based on 
corresponding GARCH-type specifications. In order to account 
for non—Gaussianity and dynamic dependencies of the standardized 
process u, the normal density $(-) is expanded using the square of 
the polynomial 


ku 
Pluz) = D> dalze)ur , (10-17) 
ja]=0 
where u® = [h;u and ja| = Sy, |a:|. The parameter k, 


denotes the degree of the polynomial and controls the extent to 
which hj (yz|Zt; Aq) deviates from the normal density. For k, = 0 the 
density function h3(y:|2:;,) reduces to that of a normal distribu- 
tion. To achieve identification, the constant term of the polynomial 
is set to 1. To allow for deviations from normality to depend on 
past values of y,, the coefficients a,(z,) are polynomials in z; given 
by 


kz 
aal) = 5y aapzÉ ; 
|8|=0 
where z? = JJEC? z® and |6| = EEP |G;,|. For k, = 0 the deviations 
from the shape of a normal ee are independent from z+. 


Summing up, the leading term of the SNP model, obtained for 
ku = k, = 0, is a Gaussian VAR-ARCH specification depending 
on the lag lengths J, and ls. This leading term captures the 
heterogeneity in the first two moments. The remaining features 
of the data such as any remaining non-normality and possible 
heterogeneity in the higher-order moments are accommodated by 
an expansion of the squared Hermite polynomial P(uz, zt)? (us) 
controlled by k, and k,. To estimate the parameter vector Aq, 
whose dimension is determined by l, ls, ku, and k,, the ML method 
can be used. For this purpose, the integration constant of the SNP 
model (10-14) can be computed analytically by applying the recur- 
sive formulas for the moments of a standard normal distribution 
(see, e.g., Patel and Read {1982]). 


If the dimension of the SNP model q increases with the sample 
size T, the Quasi-ML estimate of the SNP model hi (y:|z:; Aq) is, 
under weak conditions, an efficient nonparametric estimate of the 


292 Simulation Based Method of Moments 


true density ho(y:|z:) (see, Fenton and Gallant [1996a,b]). Further- 
more, Gallant and Long [1997] show that the indirect inference 
estimator with the SNP model as the score generator (or EMM 
estimator) attains the asymptotic efficiency of the ML estimator 
by increasing the dimension q. However, how to determine the 
adequate specification of the SNP model, i.e., to select lp, ls, ku 
and k,, remains a difficult problem. In most practical applications 
(see e.g., Gallant, Rossi and Tauchen [1992], Gallant and Tauchen 
[1996b] and Tauchen [1997]) the dimension q of the SNP model is 
successively expanded and the AIC or BIC model selection criteria 
are used to determine a preferred specification. Then, in order 
to prove the adequacy of the chosen specification, diagnostic tests 
based on the standardized residuals are conducted. An alternative 
approach to prove the adequacy of the chosen SNP specification is 
followed by Liu and Zhang [1998], who propose an overall goodness 
of fit test for the auxiliary model based on the partial sum process 
of the score series of this auxiliary model. 


10.5 Some Practical Issues 


In many cases the application of simulation techniques require an 
immense amount of computer power and thus some care is necessary 
when implementing the simulation procedures. In this section we 
therefore address some practical problems and report implications 
of recent Monte Carlo studies concerning the properties of simula- 
tion based estimators. 


10.5.1 Drawing Random Numbers and 


Variance Reduction 


In most applications the simulation based estimator is obtained 
by optimizing the criterion function using an iterative algorithm. 
At each iteration step the criterion function must be estimated 
through simulations for the current parameter values. For such 
an algorithm to converge, it is important to use common random 
numbers at every iteration step. With regard to the reduced form 
Yt = 0(2,€4;9) of the model, the use of common random numbers 
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means that for every value of @ during the iterative optimization 
procedure, the same set of simulated random variables {ei} is 
used to generate simulated values of y,. If at each iteration step 
new values of €, were drawn, some additional randomness would 
be introduced and the algorithm would fail to converge (see, e.g., 
Hendry [1984)). 


As shown above, the overall variance of simulation based es- 
timators consists of two components. The first one represents the 
variance of the estimator as if it was based on the exact criterion 
function and the second one is the Monte Carlo sampling variance 
due to the use of simulations to evaluate the criterion function. 
The first component is irreducible whereas the second component 
can be made arbitrarily small by increasing the simulation sample 
size. Unfortunately, this often leads to an enormous increase in 
computing costs. However, there exists a number of techniques 
developed for reducing the Monte Carlo sampling variance without 
increasing the computing costs, for instance, the antithetic variates 
and control variates procedures. 


The idea of the antithetic variates procedure as applied, for 
example, by Andersen and Lund [1997] for an indirect inference 
estimator is as follows. If we want to estimate a quantity w by 
simulations, here for example, the moment conditions (10-11), we 
construct two estimates for these moment conditions according to 
estimator (10-13), say © and ©, that are negatively correlated. 
Then the average $(@, + 2) has a lower variance than either of 
the two individual estimates. Assuming that the error term c; in 
the reduced form of the model has a symmetric distribution around 
zero, negatively correlated estimates of moment conditions w can 
be produced by using a set of simulated values {e0} for @, and 
the same set of simulated values but with the opposite sign, i.e., 
{—e{}, for @z. The additional computing costs of these proce- 
dure are negligible and the reduction of the Monte Carlo sampling 
variance may be considerable as reported by Andersen and Lund 
[1997]. 


The control variates technique, as applied by Calzolari, Di Iorio 
and Fiorentini [1998] for indirect inference, uses two components 
for the final Monte Carlo estimate of the quantity of interest w. The 
first component is the natural Monte Carlo estimate of w denoted 
by @*, and the second one is an estimate & created from the same 
set of simulated random numbers as &* with known expectation and 
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a positive correlation with ©*. Then the final estimate of w based 
on the control variate @ is given by © = (@* — ©) + E(®). Under 
suitable conditions, the variance of © is considerably smaller than 
that of the natural estimator ©*. Calzolari, Di Iorio and Fiorentini 
[1998] adjust the parameter based indirect inference estimator by 
control variates created from the difference (A — Ar), where Ar is 
the estimate of the auxiliary parameter À based on the observed 
data and A is an estimate of À using simulated data from the aux- 
iliary model. These simulated data are generated using Àr as the 
parameter vector and the same set of simulated random numbers 
as for the indirect inference procedure itself. Based on Monte Carlo 
experiments, they show that the indirect inference estimator com- 
bined with control variates and applied to continuous-time models 
(see Example 10.2) reduces the Monte Carlo sampling variance 
substantially relative to the simple indirect inference estimator. 


10.5.2 The Selection of the Auxiliary Model 


Indirect inference has been applied to a variety of financial time 
series models. Next, we discuss strategies used to select an auxiliary 
model (or score generator). 


A data dependent choice of the auxiliary model based on an 
expansion of the SNP model (10-14) has been followed by Gal- 
lant and Tauchen [1996b], Tauchen [1997] and Andersen and Lund 
[1997] to estimate continuous-time models for interest rates, as the 
Cox-Ingersoll-Ross and Chan-Karolyi-Longstaff-Sanders specifica- 
tion (see Example 10.2). The same approach is used by Gallant, 
Hsieh and Tauchen [1997] for the estimation of discrete-time SV 
models (see Example 10.1) for interest rates, stock returns and 
exchange rates. In these applications the dimension q of the SNP 
auxiliary model determined by model selection criteria, is typically 
quite large, resulting in a multitude of auxiliary parameters and 
hence in a large number of moments. It turns out, that an expansion 
of the scale function as that in equation (10-16) is necessary to ac- 
comodate for the typically observed conditional heteroskedasticity 
of financial time series. The expansion of the polynomial (10-17) 
is important to capture, for instance, the leptokurtic distribution 
of financial time series not accomodated by a time varying scale 
function and possible asymmetries of this distribution. 
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Simpler auxiliary models, which are close to the structural 
model, resulting in a similar number of auxiliary and structural 
parameters, are chosen by Broze, Scaillet and Zakoian [1995] 
and Engle and Lee [1996]. To estimate the Cox-Ingersoll-Ross 
and Chan-Karolyi-Longstaff-Sanders specification for interest rates 
Broze, Scaillet and Zakoian [1995] use auxiliary models based on 
simple discrete-time Euler approximations of the corresponding 
continuous-time model. Engle and Lee [1996] apply GARCH 
specifications as auxiliary models to estimate continuous-time SV 
models for exchange rates, interest rates and stock returns. 


However, the data dependent SNP approach to select an aux- 
iliary model is motivated by asymptotic arguments indicating that 
this approach ensures a high level of efficiency of the indirect in- 
ference estimator when the maintained structural model is true. 
Clearly, if the structural model is true, a simple auxiliary model 
very close to it in the sense that it reflects all salient features of 
the structural model can also be expected to ensure a high level of 
efficiency. Nevertheless, the data dependent SNP approach seems 
to be more adequate if we are interested in detecting possible 
misspecifications of the structural model based on corresponding 
specification tests (not discussed here). 


10.5.3 Small Sample Properties of the 


Indirect Inference 


As we saw, the theory of the indirect inference estimator, as de- 
veloped by Gouriéroux, Monfort and Renault [1993], Gallant and 
Tauchen [{1996a] and Gallant and Long [1997], is based on as- 
ymptotic arguments. This raises the question of its finite sample 
properties. A comprehensive Monte Carlo study of the performance 
of EMM in finite samples is conducted by Andersen, Chung and 
Sørensen [1998]. They use the stochastic volatility model (see Ex- 
ample 10.1) to compare EMM with GMM and likelihood based esti- 
mators and to address the adequate parametrization of the auxiliary 
model. Their key findings are that EMM provides, regardless of the 


3 For specification tests based on indirect inference, see e.g., Gouriéroux, 
Monfort and Renault [1993], Tauchen [1997] and Gallant, Hsieh and Tauchen 
[1997]. 
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sample size, a substantial efficiency gain relative to the standard 
GMM procedure. Furthermore, the likelihood based estimators 
are generally more efficient than the EMM procedure, but EMM 
approaches the efficiency of the likelihood based estimators with 
increasing sample size, in harmony with the asymptotic theory of 
the EMM estimator. Finally, they find evidence that score gener- 
ators based on an over—parametrized SNP model lead, especially 
in smaller samples, to a substantial loss of efficiency. Specifically, 
they show that the substitution of an ARCH type scale function 
in the SNP model as given in equation (10-16) by a GARCH type 
specification, improve the efficiency of the EMM estimator. In fact, 
this substitution reduces the number of parameters necessary to 
capture the autocorrelation in the variance. 


10.6 Conclusion 


Recently, simulation based inference procedures have become popu- 
lar in particular in empirical finance. This is due to the complexity 
of the standard models implied by latent factors or continuous-time 
processes, for example. This chapter reviewed different approaches 
for the estimation of the parameters of interest based on a GMM 
criterion function. The MSM approach is the simulated counter- 
part of the traditional GMM procedure and is applicable when the 
theoretical moments cannot be computed analytically. However, 
in many applications it is not clear how to choose the moment 
conditions. In nonlinear models the structure implies restrictions 
on a wide range of moments and, therefore, it is difficult to represent 
the main features of the model using a few moment conditions. 
In such cases it seems attractive to use a simple auxiliary model 
which approximates the main features of the structural model. In 
most cases, however, the relationship between the parameters of the 
auxiliary model and the parameters of interest is too complicated to 
admit an explicit solution. Hence, simulation techniques are applied 
to evaluate the binding function linking the parameters of interest 
with the parameters of the auxiliary model. Two asymptotically 
equivalent approaches for such indirect infererence are available. 
Gouriéroux, Monfort and Renault [1993] use a minimum distance 
procedure whereas Gallant and Tauchen [1996a] use the scores of 
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the auxiliary model as the moment condition to be matched by a 
(simulation based) GMM procedure. 


Since the efficiency of an indirect inference procedure crucially 
depends on the potential of the auxiliary model to approximate the 
model of interest, it seems attractive to use flexible nonparametric 
models as score generators. Such estimation procedures are known 
as EMM estimators in the literature and seem to be a fruitful and 
a promising field of future research. 
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Chapter 11 


LOGICALLY INCONSISTENT LIMITED 


DEPENDENT VARIABLES MODELS 


J. S. Butler and Gabriel Picone 


Simultaneous equations models involving limited dependent vari- 
ables can have nonunique reduced forms, a problem called logical 
inconsistency in the econometrics literature. In response to that 
problem, such models can be compelled to be recursive (Maddala 
[1983], Amemiya [1985]) or recast in terms of the latent variables 
(Mallar [1977]). In labor economics and elsewhere, this approach is 
often contrary to structural modelling; theory involving education, 
childbearing, and work, for example, naturally leads to models with 
simultaneously related limited dependent variables. Restricting 
these models to be recursive is inconsistent with the theory. 


It is widely believed among economists that logically incon- 
sistent models cannot be data generating processes (see Amemiya 
[1974] and Maddala [1983]). However, Jovanovic [1989] showed 
that the structural form of a model with nonunique reduced forms 
can be identified. That raises the possibility that these models can 
produce outcomes which are random variables even if the process 
is logically inconsistent. 


An alternative interpretation of these models is that they 
can generate more than one equilibrium for the endogenous vari- 
ables for some values of the exogenous variables and disturbances. 
Viewed this way, the problem can be solved by using a selec- 
tion rule (Dagsvik and Jovanovic [1991], Goldfeld and Quandt 
[1968]], Hamilton and Whiteman (1985]) or collapsing the possibly 
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nonunique equilibria into one outcome for purposes of estimation 
(Bresnahan and Reiss [1991]). 


This chapter combines the use of a selection rule to choose 
among alternative equilibria with the insights of Jovanovic [1989] 
and standard GMM estimation theory to suggest an alternative to 
the method of Bresnahan and Reiss [1991] to identify and estimate 
simultaneous equations models of limited dependent variables. Sec- 
tion 11.1 shows a logically inconsistent model that describes a data 
generating process. Section 11.2 shows that the model is uniquely 
identified. Section 11.3 argues that by standard GMM estimation 
theory the model can be estimated independent of the selection 
rule. Section 11.4 concludes the chapter and discusses extensions 
of the results. 


11.1 Logical Inconsistency 


Consider the following model of two simultaneous equations in 
endogenous dummy variables: 


Yi = Yayo + 248, + U = Yay + 21 + u 
Ya = N1241 + LB, + U2 = Nay + Z2 + Us 


(11-1) 
yı =1, if yf >0; 0, otherwise; 


y2 =1, if y >0; 0, otherwise. 


Only yi, Y2; £1, and z, are observed. This model could be widely 
used in labor economics and elsewhere. Bresnahan and Reiss [1991] 
showed that this model can be derived as the result of a simulta- 
neous move game, where the researcher observes only the players’ 
actions, but the players observe each other’s payoffs, providing one 
possible economic interpretation of the above model when y, and 
Yz represent the actions of two different players. However, y, and 
ye may represent also the actions of only one individual. 


Let %21 and 712 be positive. The reduced form correspondence 
W4(u), where ¢ indexes structural relationships between observed 
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and latent endogenous variables, associated with model (11-1) is 
y =(1,1) if we A= {u: (u, u2) € (—2, 00) x (—22, 00)} 
U {u : (u1, u2) E (21, 00) X (—22 — V12, —22)} 
U {u : (ui, u2) E€ (—21 — V21; —21) X (—22, 00) }; 
y? = (1,0) if 
uceB= {u : (ui, U2) Ç (—2, 00) x (—o0, ao — 12) }; 
y? = (0,1) if 
u E€ C = {u : (u1, u2) E (—00, —21 — %21) X (—22, 00) }; 
y* = (0,0) if 
u E€ D = {u : (u, u2) € (—00, —21 — Ya1)(—00, —22 — Nz) }; 
U {u : (u1, u2) E (—00, —21 — Y21) X (—2Z2 — V12, —22)} 
U {u : (u1, Ue) E (—21, —V21, —21) X (—00, —22 — V2) }; and 
y or yt if 
uc k= {u . (u1, U2) Ee (—2z1 — V21; —2) x (—z2 — 7125 —Zz)}. 
Note that AUBUCU DUE = {u: (w, u2) € R’}. 

The reduced form is a correspondence, and it would be a 
function if the equilibrium were unique. Note that the multiple 
equilibria arise in only one region, and in any other region there is 
a unique equilibrium. Let U,(u), Vo(u) € Vy(u), where Y, (u) is 
the function such that if u € E then (u) = yt, while V2(u) is 
the function such that if u € E then V2(u) = y”. Then (u) and 
W.(u) are identical unless u € E. 

In model (11-1), %12 > 0 and %21 > 0, but if both are negative, 
there is also a range of values of u, and uz which produces multiple 
equilibria. If y12%21 < 0, there is a set of values of u, and uz which 
produces no equilibrium. If 7127721 = 0, the model produces unique 
equilibria. 


To solve the problem of nonuniqueness two solutions have been 
proposed. One is to treat the events where the multiple equilibria 
arise as one event. See Bresnahan and Reiss [1991]. In model 
(11-1), y’ and y* are the same event. The other alternative is to 
allow for a selection rule that assigns the nonunique region to y! if 
u E€ FC E or toy’ if u € (E — F). Then the selected alternative 
can be written as: 


Yp(u) = Ip(u)t (u) + (1 — Ip(u))Vo(u), (11-2) 
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where J(u) is an indicator function that takes the value 1 if u € F 
and 0 otherwise. With the help of this selection rule we can select 
any Y; € Uy. fu € E° then (u) = Yi(u) = V2(u) and if 
u € E then we can construct a set F such that if U;(u) = y! 
then u € F and if U,(u) = y* then u € (E — F). It follows 
that for each U;(u) € U4(u) there exists a unique set F such that 
W;(u) = Up(u). The use of a selection rule to deal with econometric 
models with multiple equilibria was first proposed by Goldfeld and 
Quandt [1968]. For more recent applications, see Hamilton and 
White [1985] and Dagsvik and Jovanovic [1991]. 


The problem with models with multiple equilibria is that they 
cannot define a random variable for the dependent variable. For 
example, in model (11-1) the sum of the probabilities of the four 
distinct outcomes do not add to one and model (11-1) cannot 
completely define the random vector y = (y1, y2) unless we impose 
certain restrictions to preclude multiple solutions. See Maddala 
[1983, p. 119]. A necessary and sufficient condition for this model 
to have a unique solution is that the model be recursive. See 
Amemiya [1974]. This econometric assumption is often contrary 
to many theoretical structural models. Therefore, the above model 
has rarely been used. 


Thus, it is important to find the conditions under which any 
function U;(u) € Yy(u) generates a random variable. By defini- 
tion, a random variable is a real-valued measurable function. The 
following proposition states that if there is a measurable selection 
rule to select between the two equilibria, model (11-1) generates a 
random variable. 


Proposition 11.1 


W;(u) € (u) is a random variable if F' is a measurable 
set where F is the set such that U;(u) = Yp (u). 


PROOF OF THE PROPOSITION 


Note (u) : R? — R? so W;(u) is measurable if and only if 
each component of U; is measurable (see Billingsley [1986, 
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p. 184]). Y;(u) = (W}(u), U?(u)) where 

i) [1 ifue Hy =(AUB)U(ENF) 

Wu) = {5 ifue He = (CUD) U(ENF*) 
and 

2,\_f1 ifueé Hy =(AUC)U(ENF) 
Vi(u) = {5 if u € Hi = (BU D)U (EN F°) 


Wi(u) and W?(u) are characteristic functions which are 
measurable if and only if H, and Hz are measurable sets 
(see Royden [1988]), but A, B, C, D, and E are by de- 
finition measurable sets. It follows that Hı and Hy» are 
measurable sets if and only if F is a measurable set, since 
the union and intersection of measurable sets are measur- 
able. 


11.2 Identification 


The identification problem in econometrics is the one of drawing 
inferences from the probability distribution of the observed vari- 
ables to an underlying theoretical structure. In this section we 
show that model (11-1) is identified in the sense of equation (11-2) 
of Jovanovic [1989]. 


Jovanovic defines structure to be a pair s = (G, d), where G(u) 
is a particular distribution function of the latent variable u, and 
o(u, y) = 0 is a particular structural relationship between observed 
and latent endogenous variables. He shows that if a model has 
multiple equilibria then there is a set of distribution functions for 
the endogenous variable consistent with a given structure s. Let 
I(s) be this set. Then a necessary and sufficient condition for a 
model to be identified is: 


T(s)NT(s’)=0 Vs#s'. 


In model (11-1) ¢(u,y) can be characterized by the parameters 
B v B x V21, and J12. The distribution function for the latent variable 
u can be any bivariate distribution function G(u) with support 
R’. Without loss of generality assume that it is any continuous 
distribution that can be characterized by vector o of parameters. 
Then the structure of the model is s = ($, 7,0). 
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The observed endogenous variable is a discrete bivariate ran- 
dom vector that takes four different outcomes: y',y?, y?, and y*. 
Then its distribution function is described by a vector ur, such that 


u? = [Pr.(y'), Prs(y”), Prs(y*), Prs(y*)] 
and 
Pr,(y’) + Prey?) + Pr.(y*) + Pre(y*) =1. 
Pr,(y'),i = 1, 2,3, 4, is a function of s, the structure. 


By Proposition 11.1 every measurable selection from the cor- 
respondence generates a particular distribution for the endogenous 
variable. For example, let Ųp(u) be the selected alternative; then 
the associated distribution function for the endogenous variable is 

P,(A) + P,(F) 
2 P,(B) 
HF = P,(C) 
P,(D) + P(E) — P,(F) 


Assuming that P,(E) # 0, then up can be written as: 


P,(A) + P(E) P(A) 

_ Pr(F) P,(B) LE P(F) P,(B) 

"r= PIE) P,(C) P,(E) PC) 
P,(D) P,(D) + P,(E) 


where p, and u are the distribution function associated with Y; (u) 
and W».(u) respectively. Thus, the set of distribution functions 
consistent with the model, T (s), is given by all the yr defined as in 
equation (11-3) such that (a) F is measurable and (b) F C E. The 
following proposition proves that if u is an absolutely continuous 
random variable and P,(E£) > 0 then I'(s) is given by all of the 
possible convex combinations of u, and u2. 


Proposition 11.2 


If u is an absolutely continuous random variable and 
P,.(E) > 0, then for the endogenous variable consistent 
with the structure given by model (11-1), the set of distri- 
butions is given by 


T'(s) = {Ha : Ha = ap + (1—a)y2, a € [0,1]}. 
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Assume that P,(E) =c > 0. We need to show that 

a) If Up(u) € Vy(u) then wr = apy + (1 — a)y where 
a € [0,1]. 

b) If pa = au + (1 — &æ)uz for a € [0,1] then pa € I(s). 
Part a) follows from equation (11-3), since the proba- 
bility distribution associated with any Ur € Vy can be 
written as pr = (Pr(F)/c)y, + (1 — Pr(F)/c)us, where 
c = Pr(E). Because F C E,Pr(F) < Pr(E) = c. Thus 
0 < Pr(F)/c = a < 1. Part b) follows from Jovanovic 
[1986], Theorem 3.1, which says that if u is an absolutely 
continuous random variable and pı and uz belong to IT(s) 
then all of the convex combinations of p, and uz also belong 
to T(s). 


If u is an absolutely continuous random vector with pdf f then 


bı pbz 
[of fesus)durdus = 


G(b,, bo) = G(b;, az) = G(a, b2) + G(a, a2) 


and the ae of the different regions can be written as 
ae 
Pr(A) = g (u1, u2jduiduz + s f(u, u2)duiduz 
22921 —22 ¥—-Z1—-721 
gh 


+ : E (u1, u2)duduz 
Pr(A) =1- Coe + G(—2, —22 — V12) — G(—21, —22) 
+ G(—21 — V21, —22) — G(—21 — %21, ©) 
Pr(B) = [ ae f (un, ug)duyduy 
= G(00, -22 — ha) — G(=21, =22 — 2) 
Pr(c) =f i [0 Kinw)dndu = G21 = 121,00) 


a G(—zı — V21, — z2) 


41-721 22-7912 741 
Pr(D) = A io flu, u2) )du,duz + J. Ji 
22—12 41-721 


22-712 —21—-721 
f(u, U2) dus duz +f i f (ur, uz)duiduz 
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Pr(D) = G(—21, —22 — N12) — 
G(-a — Ya1, —2%2 — y2) 
va G(—2ı — Yai; —Z2) 
Pr(E) = J í i f (ur, U2) du; dug 
22-712 


—21—-721 
= G(—2, —22) — G(—21, —22 — %12) — G(—21 — Y21, —22) 
+ G(—21 — %1, —22 — 2). 
Thus, the set of distribution functions associated with structure 
s = (Y, c) is given by 
P(s) = {Ha : Ha = [(1 — G(00, 22 — %12) — G(—21 — V21, ©) 
+ aG(—21 — Y1, —22 — V12) 
+ (1 —a)(G(-21, —22 — V12) + G(—21 — V21, —22) 
— G(~—21, —22)), 
(G(œ, —22 — V12) — G(—21, —22 — V12)), 
(G(—21 — Y21,00) — G(—21 — V1, 22); 
a(G(—21, —22 — V12) — G(—21 — Y21, —22 — V12) 
+ G(—z1 — %21, —22)) + (1 — a)(G(—21, 22))],@ € [0, 1]. 
Now, consider a different structure s’ = (Y’,o) such that yo Æ 
Vie Ia E Vab, F B, or B, # B, but with the same distribution 
function for u that can be characterized by the same vector of 
parameters o. Then the model is uniquely identified since 
Pr,(B) = G(o, —22 — %12) — G(—21, —22 — Na) F 
G00, —25 — Yi2) — G(—21 —2 — Nie) = Prs(B) 


or 
Pr,(C) = G(-21 — N21, 00) a G(-z — V21, —22) £ 
G(-21 — Ya, ©) — G(—21 — Vi, —22) = Pre (C) 
or both, except for exactly offsetting combinations of zı and ‘yo; 
or 22 and 12, which have measure zero for continuously distrib- 


uted x or require exact matches between discrete values of z and 
coefficients. 


If a different structure (s”) = (Y’,o’) implies also a different 
vector of parameters a’ then the model is not identified and we need 
to impose further restrictions, such as normalizing the variances 
or covariances, but this is a problem also shared with logically 
consistent models. 
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The proofs presented here and the GMM estimation strategy 
which follows from it apply directly to bivariate, binomial probit 
and logit but could be adapted readily to tobit models, ordered 
probit or logit models, and to combinations of those models with 
regression, for example, a selection bias model with selection af- 
fected by the outcome of the selected equation. 


11.3 Estimation 


Model (11-1) can be estimated using the joint probability distrib- 
ution or the conditional probabilities of each outcome given the 
other. 


Model (11-1) can be estimated using the structural probabili- 
ties of the various outcomes if and only if the selection method is 
fully specified. Whether the likelihood function is used in MLE or 
the expectations of the proportions found in each outcome are used 
in GMM estimation, the result is inconsistent unless the selection 
method is correctly specified. 


Bresnahan and Reiss [1991] combine the equilibria which may 
overlap, but we suggest an alternative method which does not 
abandon the direct estimation using the above equations which 
are so convenient given the familiarity of equation—by—equation 
methods in econometrics and the usefulness of estimates of marginal 
impacts in applied research. The alternative approach does not 
require additional assumptions to be embodied in the estimation, 
using only the conditional expectations of yı given ye and the 
exogenous variables and of y2 given yı and the exogenous variables. 
For example, if yı = 1, then yz is determined by y2 + 2581 + U2 
regardless of whether yı = 0 is also possible, or what selection rule 
is used to choose yı. In the case of linear probability models, which 
result from uniform marginal distributions of the disturbances, the 
result is two-stage least squares applied to the two (or more) equa- 
tions. Other distributions result in nonlinear equations which may 
be identified either with exclusions or nonlinear functions of the 
instruments and estimated with GMM. Butler and Mitchell [1990] 
apply these ideas to a bivariate normal model. 


To illustrate the estimation, assume model (11-1) with a probit 
model specification, i.e., standard bivariate normally distributed 
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disturbances. Let {z} = {z,} U {x2} U {x3}, where z, consists of 
nonlinear functions of g, and x5, be the instrumental variables. 
Identification by exclusions is usually less collinear and more re- 
liable when it is feasible. The orthogonality conditions based on 
model (11-1) could be as follows. 


Elz(y1 — (Yay + £18,)) 
Elz(y2 — ®(m241 + £(3,)) 


The estimation presents no more difficulty than the usual probit 
model. Butler and Mitchell [1990] estimate a simultaneous probit 
model of cardiovascular disease and having earnings, each measured 
as a discrete outcome. They find, other things equal, that cardio- 
vascular disease reduces the probability of having earnings and that 
having earnings reduces the probability of cardiovascular disease 
(Yor < 0 and %2 < 0). Perhaps because of weak instruments, 
they also find that the standard errors rise considerably relative 
to estimates assuming all regressors are exogenous; that can be a 
problem in any instrumental variable, two-stage least squares, or 
GMM estimation. 


3 (11-4) 
=Q. 


By extending the results in this chapter, earnings could be 
estimated as a continuous outcome. The bivariate ordered probit 
model could also be adapted to this estimation. 


A Hausman test of the specification of the selection method 
is possible, since more efficient estimation of the model is possible 
with the rule, and less efficient but consistent estimation of the 
model is possible without the rule. The Hausman test has a null of 
a joint hypothesis including the specification of the distribution of 
the disturbances. 


11.4 Conclusion and Extensions 


Logically inconsistent models with multiple equilibria consistent 
with observable distributions of data can be estimated (1) by re- 
stricting parameters; (2) by treating nonunique equilibria as obser- 
vationally equivalent, collapsing the potentially overlapping equi- 
libria in all cases, as suggested by Bresnahan and Reiss [1991]; or 
(3) by finding identified parameters with testable implications, es- 
timating the conditional expectations of the endogenous variables. 
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This chapter explores the last option, suggesting a GMM estimator 
which does not require the selection rule to be specified. 


The most efficient estimation specifies the rule used to find 
equilibria in the regions left ambiguous by the standard limited 
dependent variable models, and uses the rule to specify the likeli- 
hood function. If a proposed rule is available, it forms the basis of a 
Hausman test versus the GMM estimator presented in this chapter. 


The model presented here is a bivariate probit or logit model, 
but the same arguments could be applied to multivariate models, 
ordered probit, logit, or ordered logit models, and simultaneous 
regression and probit, logit, or tobit models. A selection bias model 
in which selection is affected by the outcome of the regression could 
be estimated with the methods of this chapter. 
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